Employee attrition is critical to the wellbeing of a company. High employee turnover leads a distruption to workflow and production, limited rare resources such as time and labour. Furthermore, on average it takes $1,200 to recruit and train a new employee. It is in the best interest to maintain low employee turnover and examine employee's motives to stay or leave a company.
Business Understanding
The CRISP-DM model was used to analyse this dataset. The first step the CRISP-DM model is to have a clear Business Understanding, which requires the researcher to understand business objectives and goals, as well as business success and fails. It is also vital to understand what would make this analysis a success and what is the criteria for sucess.
Data Preparation
The next step in CRISP-DM model is the Data Preparation, which means that the dataset has to be cleaned and treated, understanding each column of the data. It is vital to conducted data exploration during this part by conduction descritptive statistics analysis, such as mean and standard deviation of numeric variables, and creating graphs.
Data Modelling
After data preparation, data modelling will be conducted. If the research being conducted is going to be supervised, the researchers must know what is their target or independent variable prior to data mining. This allows the researchers to identify appropriate machine learning (ML) techniques for the dataset. Identifying specific assumptions of each modelling technique is vital for this stage.
Evaluation
As the models are developed, they need to be evaluated. The researchers must be fluid between modelling and evaluation stages because the models can be improved upon once evaluated.
Deployment
Once the model is developed, it is important to organise and present the knowledge and insight gained in a manner that the customer can use it. Regular maintenance has to be scheduled to ensure that the model is up-to-date with the new data being imported into the model to ensure its accuracy.
Since the start of pandemic, there has been labour shortage, which has been coined as “Great Resignation” (Miel, 2021), even individuals in executive ranks of a company have been changing positions (Walsh, 2021). Several factors impact the reasons for employees leaving such as lack of promotional opportunities and inability to work from home.
The cost of a new hire has been examined by Indeed in 2021, which stated that it costs on average $1,200 involving different type of training such as instructor- led training, online learning programs, mentoring and hand-on learning.
Several hidden costs of training new employees have been identified as the:
For several years, the low employee turnover was examined through the lens of business and stakeholders. Businesses aimed to reduce the turnover due to the costs to the business, however, reasons for employee turnover were left unexamined and misunderstood. In the current workplace environment and ‘The Great Resignation’, which are in the post-Covid era, there is a shift in businesses understanding of what motivates and retains great employees, as well as a focus on why great employees leave.
There are several business objectives that are imported to the business.
Project will be deemed successful if the model has 80% accuracy with the aim of reducing the model predicting that the employee stays but in reality the employee leaves. The model's ability to predict the features of employees who stay can help to examine the difference between employees who left and perhaps change the working environment for those leavers.
IT Carlow Library has a vast number of books on machine learning algorithms, as well as excellent lecturers who can answer difficult questions about machine learning. There will be no other individual working with me on this assignment due to the individual assignment for this module. There will be no updates of data during this project so the data that is gathered at the start is the same data at the end of the project.
Several assumptions are made during this project such as data is verified during data mining process and the data is clean. There are no missing values for categorical variables, while a zero in numeric variables is viewed as accurate.
There were several constraints made during this project. The link to data dictionary was not available so it was difficult to understand what somevariables were. Another issue was the size of the dataset, by most standards it was considered rather small, only 1470 observations.
There are no risks that might delay this project. Like any project, it is a possibility that there is not enough data observations to a lead to best data mining processes, great insights and precises deployment.
There is some important terminology for this project which will be highlighted in this section.
This terminology is commonly used in evaluation of the model in a confusion matrix, which will be generated for each model to evaluate it using recall, precision and accuracy.
Important evaluation terminology:
The benefit of this project is the identification of some features that could help identify employees who are potentially likely to leave. This could lead to a (costly) intervention to retain those employees. However, this could lead to a better work place environment and better work culture.
The task of data mining goals would be to predict which employees are more likely to leave the company than to stay.
Three different models will be compared to access which model has the best ability to classify the minority class accurately: Decision Tree, Random Forest and Logistic Regression
Although reducing both false negatives and false positives is important, it was decided to focus on false positives, the model predicting that the employee would stay but in reality they left the company, because the model would fail to predict a turnover of staff, which is exactly the opposite of the desired result.
By focusing on reducing False Positives, it is very possible that this will result in the rise of False Negatives. Although this may seem counter-intuitive, this would be highly useful for us in this project. Taking measures to maximise True Negatives and to reduce False Positives were occur such as reduction in overtime, salary raise and hybrid working model would result in less employees leaving the company. Individuals who were in categorised as False Negatives would benefit from these perks and benefits, which could lead to job satisfaction and environment satisfaction. Although this be viewed as an expensive intervention, HR research and LinkedIn have stated it is very important for the company to have employees' well-being in mind and should be even in the company's core values. HR managers from competitors’ companies continue to attract unhappy employees due to overwork, lack of working hybrid working model, terrible wages in their current jobs.
The AUC value (Area Under the Curve) and ROC curve will also be used to evaluate the performance and prediction functions for each model.
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.
When AUC is between 0.5 and 1, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values. This is so because the classifier is able to detect more numbers of True positives and True negatives than False negatives and False positives.
The receiver operating characteristic (ROC) curve is a graph where the false positive rate is placed on the X-axis and the true positive rate is placed on the Y-axis. The ROC curves are useful to visualize and compare the performance of classifier methods.
bah
A model will be deemed successful when precision of a model is 70%, which would mean that the False Positives have been successfully reduced.
A project plan was created to ensure a steady workload and completion by the deadline by 20th April 2022. The plan was devised in such manner that would allow a lot of revisions in case re-evaluation of techniques was decided upon.
Upon an extention of the project, the plan was not revised due the current devised plan going to plan.
Python 3.9 was chosen to prepare and visualise data, as well as build and evaluate models. This will be conducted in Jupyter Notebook.
Tableau was chosen to create an in-depth visualisation
Several research questions were devised to explore data:
The dataset consists of past and current employees in a spreadsheet. Dataset was downloaded from https://www.kaggle.com/patelprashant/employee-attrition
According to Kaggle Description, the dataset was previously avaiable on IBM, however, it has been since taken down. https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/
# downloading important packages
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import seaborn as sns
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
from statsmodels.stats.outliers_influence import variance_inflation_factor
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE # imblearn library can be installed using pip install imblearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree # tree diagram
from imblearn.over_sampling import SMOTE
from sklearn.metrics import mean_squared_error, r2_score
# Import LogisticRegression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import PrecisionRecallDisplay
# # Connect to Google Drive
# # mount google drive to the virtual machine to use files from within here
# from google.colab import drive
# drive.mount('/content/drive')
# # reading in the csv
# df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Data Algorithm/HR-Employee-Attrition.csv')
# Importing data in Jupyter Notebook
# read the CSV file
df = pd.read_csv('HR-Employee-Attrition.csv')
# checking the df shape
print('The df has {} rows and {} columns.'.format(*df.shape))
The df has 1470 rows and 35 columns.
# prints out the first row for the dataset
print(df.head(1).transpose())
0 Age 41 Attrition Yes BusinessTravel Travel_Rarely DailyRate 1102 Department Sales DistanceFromHome 1 Education 2 EducationField Life Sciences EmployeeCount 1 EmployeeNumber 1 EnvironmentSatisfaction 2 Gender Female HourlyRate 94 JobInvolvement 3 JobLevel 2 JobRole Sales Executive JobSatisfaction 4 MaritalStatus Single MonthlyIncome 5993 MonthlyRate 19479 NumCompaniesWorked 8 Over18 Y OverTime Yes PercentSalaryHike 11 PerformanceRating 3 RelationshipSatisfaction 1 StandardHours 80 StockOptionLevel 0 TotalWorkingYears 8 TrainingTimesLastYear 0 WorkLifeBalance 1 YearsAtCompany 6 YearsInCurrentRole 4 YearsSinceLastPromotion 0 YearsWithCurrManager 5
# prints out names of columns
print(df.columns)
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
# This tells us which variables are object, int64 and float 64. This would mean that
# some of the object variables might have to be changed into a categorical variables and int64 to float64
# depending on our analysis.
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB None
# checking for missing data
print('Nan in each columns' , df.isna().sum(), sep='\n')
# no missing data
Nan in each columns Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EmployeeCount 0 EmployeeNumber 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 Over18 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StandardHours 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64
# Looking for all the unique values in all the columns
column = df.columns
for i in column:
print('\n',i,'\n',df[i].unique(),'\n','-'*80)
Age [41 49 37 33 27 32 59 30 38 36 35 29 31 34 28 22 53 24 21 42 44 46 39 43 50 26 48 55 45 56 23 51 40 54 58 20 25 19 57 52 47 18 60] -------------------------------------------------------------------------------- Attrition ['Yes' 'No'] -------------------------------------------------------------------------------- BusinessTravel ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel'] -------------------------------------------------------------------------------- DailyRate [1102 279 1373 1392 591 1005 1324 1358 216 1299 809 153 670 1346 103 1389 334 1123 1219 371 673 1218 419 391 699 1282 1125 691 477 705 924 1459 125 895 813 1273 869 890 852 1141 464 1240 1357 994 721 1360 1065 408 1211 1229 626 1434 1488 1097 1443 515 853 1142 655 1115 427 653 989 1435 1223 836 1195 1339 664 318 1225 1328 1082 548 132 746 776 193 397 945 1214 111 573 1153 1400 541 432 288 669 530 632 1334 638 1093 1217 1353 120 682 489 807 827 871 665 1040 1420 240 1280 534 1456 658 142 1127 1031 1189 1354 1467 922 394 1312 750 441 684 249 841 147 528 594 470 957 542 802 1355 1150 1329 959 1033 1316 364 438 689 201 1427 857 933 1181 1395 662 1436 194 967 1496 1169 1145 630 303 1256 440 1450 1452 465 702 1157 602 1480 1268 713 134 526 1380 140 629 1356 328 1084 931 692 1069 313 894 556 1344 290 138 926 1261 472 1002 878 905 1180 121 1136 635 1151 644 1045 829 1242 1469 896 992 1052 1147 1396 663 119 979 319 1413 944 1323 532 818 854 1034 771 1401 1431 976 1411 1300 252 1327 832 1017 1199 504 505 916 1247 685 269 1416 833 307 1311 128 488 529 1210 1463 675 1385 1403 452 666 1158 228 996 728 1315 322 1479 797 1070 442 496 1372 920 688 1449 1117 636 506 444 950 889 555 230 1232 566 1302 812 1476 218 1132 1105 906 849 390 106 1249 192 553 117 185 1091 723 1220 588 1377 1018 1275 798 672 1162 508 1482 559 210 928 1001 549 1124 738 570 1130 1192 343 144 1296 1309 483 810 544 1062 1319 641 1332 756 845 593 1171 350 921 1144 143 1046 575 156 1283 755 304 1178 329 1362 1371 202 253 164 1107 759 1305 982 821 1381 480 1473 891 1063 645 1490 317 422 1485 1368 1448 296 1398 1349 986 1099 1116 1499 983 1009 1303 1274 1277 587 413 1276 988 1474 163 267 619 302 443 828 561 426 232 1306 1094 509 775 195 258 471 799 956 535 1495 446 1245 703 823 1246 622 1287 448 254 1365 538 525 558 782 362 1236 1112 204 1343 604 1216 646 160 238 1397 306 991 482 1176 913 1076 727 885 243 806 817 1410 1207 1442 693 929 562 608 580 970 1179 294 314 316 654 168 381 217 501 650 141 804 975 1090 346 430 268 167 621 527 883 954 310 719 725 715 657 1146 182 376 571 384 791 1111 1243 1092 1325 805 213 118 676 1252 286 1258 932 1041 859 720 946 1184 436 589 760 887 1318 625 180 586 1012 661 930 342 1230 1271 1278 607 130 300 583 1418 1269 379 395 1265 1222 341 868 1231 102 881 1383 1075 374 1086 781 177 500 1425 1454 617 1085 995 1122 618 546 462 1198 1272 154 1137 1188 188 1333 867 263 938 129 616 498 1404 1053 289 1376 231 152 882 903 1379 335 722 461 974 1126 840 1134 248 955 939 1391 1206 287 1441 109 1066 277 466 1055 265 135 247 1035 266 145 1038 1234 1109 1089 788 124 660 1186 1464 796 415 769 1003 1366 330 1492 1204 309 1330 469 697 1262 1050 770 406 203 1308 984 439 793 1451 1182 174 490 718 433 773 603 874 367 199 481 647 1384 902 819 862 1457 977 942 1402 1421 1361 917 200 150 179 696 116 363 107 1465 458 1212 1103 966 1010 326 1098 969 1167 694 1320 536 373 599 251 131 237 1429 648 735 531 429 968 879 640 412 848 360 1138 325 1322 299 1030 634 524 256 1060 935 495 282 206 943 523 507 601 855 1291 1405 1369 999 1202 285 404 736 1498 1200 1439 499 205 683 1462 949 652 332 1475 337 971 1174 667 560 172 383 1255 359 401 377 592 1445 1221 866 981 447 1326 748 990 405 115 790 830 1193 1423 467 271 410 1083 516 224 136 1029 333 1440 674 1342 898 824 492 598 740 888 1288 104 1108 479 1351 474 437 884 1370 264 1059 563 457 1313 241 1015 336 1387 170 208 671 711 737 1470 365 763 567 486 772 301 311 584 880 392 148 708 1259 786 370 678 146 581 918 1238 585 741 552 369 717 543 964 792 611 176 897 600 1054 428 181 211 1079 590 305 953 478 1375 244 511 1294 196 734 1239 1253 1128 1336 234 766 261 1194 431 572 1422 1297 574 355 207 706 280 726 414 352 1224 459 1254 1131 835 1172 1266 783 219 1213 1096 1251 1394 605 1064 1337 937 157 754 1168 155 1444 189 911 1321 1154 557 642 801 161 1382 1037 105 582 704 345 1120 1378 468 613 1023 628] -------------------------------------------------------------------------------- Department ['Sales' 'Research & Development' 'Human Resources'] -------------------------------------------------------------------------------- DistanceFromHome [ 1 8 2 3 24 23 27 16 15 26 19 21 5 11 9 7 6 10 4 25 12 18 29 22 14 20 28 17 13] -------------------------------------------------------------------------------- Education [2 1 4 3 5] -------------------------------------------------------------------------------- EducationField ['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree' 'Human Resources'] -------------------------------------------------------------------------------- EmployeeCount [1] -------------------------------------------------------------------------------- EmployeeNumber [ 1 2 4 ... 2064 2065 2068] -------------------------------------------------------------------------------- EnvironmentSatisfaction [2 3 4 1] -------------------------------------------------------------------------------- Gender ['Female' 'Male'] -------------------------------------------------------------------------------- HourlyRate [ 94 61 92 56 40 79 81 67 44 84 49 31 93 50 51 80 96 78 45 82 53 83 58 72 48 42 41 86 97 75 33 37 73 98 36 47 71 30 43 99 59 95 57 76 87 66 55 32 52 70 62 64 63 60 100 46 39 77 35 91 54 34 90 65 88 85 89 68 69 74 38] -------------------------------------------------------------------------------- JobInvolvement [3 2 4 1] -------------------------------------------------------------------------------- JobLevel [2 1 3 4 5] -------------------------------------------------------------------------------- JobRole ['Sales Executive' 'Research Scientist' 'Laboratory Technician' 'Manufacturing Director' 'Healthcare Representative' 'Manager' 'Sales Representative' 'Research Director' 'Human Resources'] -------------------------------------------------------------------------------- JobSatisfaction [4 2 3 1] -------------------------------------------------------------------------------- MaritalStatus ['Single' 'Married' 'Divorced'] -------------------------------------------------------------------------------- MonthlyIncome [5993 5130 2090 ... 9991 5390 4404] -------------------------------------------------------------------------------- MonthlyRate [19479 24907 2396 ... 5174 13243 10228] -------------------------------------------------------------------------------- NumCompaniesWorked [8 1 6 9 0 4 5 2 7 3] -------------------------------------------------------------------------------- Over18 ['Y'] -------------------------------------------------------------------------------- OverTime ['Yes' 'No'] -------------------------------------------------------------------------------- PercentSalaryHike [11 23 15 12 13 20 22 21 17 14 16 18 19 24 25] -------------------------------------------------------------------------------- PerformanceRating [3 4] -------------------------------------------------------------------------------- RelationshipSatisfaction [1 4 2 3] -------------------------------------------------------------------------------- StandardHours [80] -------------------------------------------------------------------------------- StockOptionLevel [0 1 3 2] -------------------------------------------------------------------------------- TotalWorkingYears [ 8 10 7 6 12 1 17 5 3 31 13 0 26 24 22 9 19 2 23 14 15 4 29 28 21 25 20 11 16 37 38 30 40 18 36 34 32 33 35 27] -------------------------------------------------------------------------------- TrainingTimesLastYear [0 3 2 5 1 4 6] -------------------------------------------------------------------------------- WorkLifeBalance [1 3 2 4] -------------------------------------------------------------------------------- YearsAtCompany [ 6 10 0 8 2 7 1 9 5 4 25 3 12 14 22 15 27 21 17 11 13 37 16 20 40 24 33 19 36 18 29 31 32 34 26 30 23] -------------------------------------------------------------------------------- YearsInCurrentRole [ 4 7 0 2 5 9 8 3 6 13 1 15 14 16 11 10 12 18 17] -------------------------------------------------------------------------------- YearsSinceLastPromotion [ 0 1 3 2 7 4 8 6 5 15 9 13 12 10 11 14] -------------------------------------------------------------------------------- YearsWithCurrManager [ 5 7 0 2 6 8 3 11 17 1 4 12 9 10 15 13 16 14] --------------------------------------------------------------------------------
There are no missing values in our dataset.
| Attribute | DataType | Description |
|---|---|---|
| Age | int | Age of an employee: range 18 to 60 |
| BusinessTravel | text | Travel for work: Travel_Rarely, Travel_Frequently, Non-Travel |
| DailyRate | int | Daily rate of employee salary |
| Department | text | The department that the employee worked in: Sales, Research & Development, Human Resources |
| DistanceFromHome | int | Number of miles away from home: range 1 to 29 |
| Education | int | The education level reached by the employee: 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor' |
| EducationField | text | The area of study: Life Sciences, Medical, Marketing, Technical Degree, Human Resources, Other |
| EmployeeCount | int | Unclear from data dictionary |
| EmployeeNumber | int | Employee number of the employee in the dataset |
| EnvironmentSatisfaction | int | 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| Gender | text | Gender of employee: Male or Female |
| HourlyRate | int | Hourly Rate of Employee |
| JobInvolvement | int | 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| JobLevel | int | 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| JobRole | text | Employee's job title: Sales Executive, Research Scientist, Laboratory Technician, Manufacturing Director, Healthcare Representative, Manager, Sales Representative, Research Director, Human Resources |
| JobSatisfaction | int | 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| MaritalStatus | text | Employee's marital status: Single, Married, Divorced |
| MonthlyIncome | int | Employee's monthly salary |
| MonthlyRate | int | Unclear from data dictionary |
| NumCompaniesWorked | int | Number of company previously worked for |
| Over18 | text | Employee's Over-18 status: Y |
| OverTime | text | Employee's overtime status: Yes, No |
| PercentSalaryHike | int | Percent of Salary Hike |
| PerformanceRating | int | 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding' |
| RelationshipSatisfaction | int | 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding' |
| StandardHours | int | Unclear from data dictionary |
| StockOptionLevel | int | Unclear from data dictionary |
| TotalWorkingYears | int | Number of total working years |
| TrainingTimesLastYear | int | Number of training times last year |
| WorkLifeBalance | int | 1 'Bad', 2 'Good', 3 'Better', 4 'Best' |
| YearsAtCompany | int | Number of total working years in the company |
| YearsInCurrentRole | int | Number of total working years in the current role |
| YearsSinceLastPromotion | int | Number of total working years since last promotion |
| YearsWithCurrManager | int | Number of total working years with current manager |
Please note: DistanceFromHome was assumed to be in miles beacuse the data is from IBm, which is an American company. This was not provided in the data dictionary.
| Attribute | DataType | Description |
|---|---|---|
| Attrition | text | Did the employee leave or not: Yes or No? |
Before the data was modelled, the data was processed and prepared for modelling. This include changing relevant variables to categories, levels of categories were reordered if required. After some deliberations, several variables were dropped.
# converting target variable into a categorical variable
# Target Variable
df['Attrition'] = df['Attrition'].astype('category')
Column 'MaritalStatus' was converted to a categorical variable, the categories were reordered, this becomes important in graphs.
# converting variable into a categorical variable
# MaritialStatus
df['MaritalStatus'] = df['MaritalStatus'].astype('category')
#reordering - important in graphs
df['MaritalStatus'] = df['MaritalStatus'].cat.reorder_categories(['Single', 'Married', 'Divorced'],
ordered=True)
#examining unique values
df['MaritalStatus'].unique()
['Single', 'Married', 'Divorced'] Categories (3, object): ['Single' < 'Married' < 'Divorced']
Numerical values in column 'Education' was changed to corresponding categorical value, it was converted to a categorical variable and the categories were reordered, this becomes important in graphs.
# Changing the numbers to appropriate category
df['Education'] = df['Education'].replace(1, 'Below College')
df['Education'] = df['Education'].replace(2, 'College')
df['Education'] = df['Education'].replace(3, 'Bachelor')
df['Education'] = df['Education'].replace(4, 'Master')
df['Education'] = df['Education'].replace(5, 'Doctor')
# converting variable into a categorical variable
df['Education'] = df['Education'].astype('category')
# #reordering - important in graphs
df['Education'] = df['Education'].cat.reorder_categories(['Below College', 'College', 'Bachelor', 'Master', 'Doctor'],
ordered=True)
#examining unique values
df['Education'].unique()
['College', 'Below College', 'Master', 'Bachelor', 'Doctor'] Categories (5, object): ['Below College' < 'College' < 'Bachelor' < 'Master' < 'Doctor']
# checking data to check that all objects have been changed to categorical variables.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null category 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null category 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null category 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: category(3), int64(25), object(7) memory usage: 372.4+ KB
# checking for counts data and gives Mean, Sd and quartiles for all columns
print(df.describe().transpose())
count mean std min 25% \
Age 1470.0 36.923810 9.135373 18.0 30.00
DailyRate 1470.0 802.485714 403.509100 102.0 465.00
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.00
EmployeeCount 1470.0 1.000000 0.000000 1.0 1.00
EmployeeNumber 1470.0 1024.865306 602.024335 1.0 491.25
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.00
HourlyRate 1470.0 65.891156 20.329428 30.0 48.00
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.00
JobLevel 1470.0 2.063946 1.106940 1.0 1.00
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.00
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.00
MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.00
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.00
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.00
PerformanceRating 1470.0 3.153741 0.360824 3.0 3.00
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.00
StandardHours 1470.0 80.000000 0.000000 80.0 80.00
StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.00
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.00
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.00
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.00
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.00
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.00
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.00
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.00
50% 75% max
Age 36.0 43.00 60.0
DailyRate 802.0 1157.00 1499.0
DistanceFromHome 7.0 14.00 29.0
EmployeeCount 1.0 1.00 1.0
EmployeeNumber 1020.5 1555.75 2068.0
EnvironmentSatisfaction 3.0 4.00 4.0
HourlyRate 66.0 83.75 100.0
JobInvolvement 3.0 3.00 4.0
JobLevel 2.0 3.00 5.0
JobSatisfaction 3.0 4.00 4.0
MonthlyIncome 4919.0 8379.00 19999.0
MonthlyRate 14235.5 20461.50 26999.0
NumCompaniesWorked 2.0 4.00 9.0
PercentSalaryHike 14.0 18.00 25.0
PerformanceRating 3.0 3.00 4.0
RelationshipSatisfaction 3.0 4.00 4.0
StandardHours 80.0 80.00 80.0
StockOptionLevel 1.0 1.00 3.0
TotalWorkingYears 10.0 15.00 40.0
TrainingTimesLastYear 3.0 3.00 6.0
WorkLifeBalance 3.0 3.00 4.0
YearsAtCompany 5.0 9.00 40.0
YearsInCurrentRole 3.0 7.00 18.0
YearsSinceLastPromotion 1.0 3.00 15.0
YearsWithCurrManager 3.0 7.00 17.0
The following columns were spotted as redundant due to lack of variety of values in the column.
# EmployeeCount has no column description and all values are '1'
df.groupby(['EmployeeCount']).size().sort_values(ascending=False)
EmployeeCount 1 1470 dtype: int64
# All employees are over 18
df.groupby(['Over18']).size().sort_values(ascending=False)
Over18 Y 1470 dtype: int64
# StockOptionLevel has no column description
df.groupby(['StockOptionLevel']).size().sort_values(ascending=False)
StockOptionLevel 0 631 1 596 2 158 3 85 dtype: int64
# PerformanceRating have 2 values and are very similar.
df.groupby(['PerformanceRating']).size().sort_values(ascending=False)
PerformanceRating 3 1244 4 226 dtype: int64
# StandardHours has no column description and all values are '80'
df.groupby(['StandardHours']).size().sort_values(ascending=False)
StandardHours 80 1470 dtype: int64
df = df.drop(['EmployeeCount','EmployeeNumber', 'Over18', 'JobLevel',
'StockOptionLevel', 'PerformanceRating', 'StandardHours', 'MonthlyRate', 'HourlyRate', 'DailyRate'], axis=1)
There are several columns that were unclear from the initial data dictionary or were too granular. Here is the table of columns dropped and the reasons for dropping them.
| Attribute | Reason |
|---|---|
| EmployeeCount | No data description was provided and all values are '1' |
| Job Level | No data description was provided |
| EmployeeNumber | Redundant - in pandas index is used |
| Over18 | All employees are/were over 18 |
| StockOptionLevel | No data description was provided |
| PerformanceRating | Only two values: Excellent and Outstanding. Unclear what is the difference between the two |
| MonthlyRate | Too granular |
| HourlyRate | Too granular |
| DailyRate | Too granular |
# Looking for all the unique values in all the columns
column = df.columns
for i in column:
print('\n',i,'\n',df[i].unique(),'\n','-'*80)
Age [41 49 37 33 27 32 59 30 38 36 35 29 31 34 28 22 53 24 21 42 44 46 39 43 50 26 48 55 45 56 23 51 40 54 58 20 25 19 57 52 47 18 60] -------------------------------------------------------------------------------- Attrition ['Yes', 'No'] Categories (2, object): ['No', 'Yes'] -------------------------------------------------------------------------------- BusinessTravel ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel'] -------------------------------------------------------------------------------- Department ['Sales' 'Research & Development' 'Human Resources'] -------------------------------------------------------------------------------- DistanceFromHome [ 1 8 2 3 24 23 27 16 15 26 19 21 5 11 9 7 6 10 4 25 12 18 29 22 14 20 28 17 13] -------------------------------------------------------------------------------- Education ['College', 'Below College', 'Master', 'Bachelor', 'Doctor'] Categories (5, object): ['Below College' < 'College' < 'Bachelor' < 'Master' < 'Doctor'] -------------------------------------------------------------------------------- EducationField ['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree' 'Human Resources'] -------------------------------------------------------------------------------- EnvironmentSatisfaction [2 3 4 1] -------------------------------------------------------------------------------- Gender ['Female' 'Male'] -------------------------------------------------------------------------------- JobInvolvement [3 2 4 1] -------------------------------------------------------------------------------- JobRole ['Sales Executive' 'Research Scientist' 'Laboratory Technician' 'Manufacturing Director' 'Healthcare Representative' 'Manager' 'Sales Representative' 'Research Director' 'Human Resources'] -------------------------------------------------------------------------------- JobSatisfaction [4 2 3 1] -------------------------------------------------------------------------------- MaritalStatus ['Single', 'Married', 'Divorced'] Categories (3, object): ['Single' < 'Married' < 'Divorced'] -------------------------------------------------------------------------------- MonthlyIncome [5993 5130 2090 ... 9991 5390 4404] -------------------------------------------------------------------------------- NumCompaniesWorked [8 1 6 9 0 4 5 2 7 3] -------------------------------------------------------------------------------- OverTime ['Yes' 'No'] -------------------------------------------------------------------------------- PercentSalaryHike [11 23 15 12 13 20 22 21 17 14 16 18 19 24 25] -------------------------------------------------------------------------------- RelationshipSatisfaction [1 4 2 3] -------------------------------------------------------------------------------- TotalWorkingYears [ 8 10 7 6 12 1 17 5 3 31 13 0 26 24 22 9 19 2 23 14 15 4 29 28 21 25 20 11 16 37 38 30 40 18 36 34 32 33 35 27] -------------------------------------------------------------------------------- TrainingTimesLastYear [0 3 2 5 1 4 6] -------------------------------------------------------------------------------- WorkLifeBalance [1 3 2 4] -------------------------------------------------------------------------------- YearsAtCompany [ 6 10 0 8 2 7 1 9 5 4 25 3 12 14 22 15 27 21 17 11 13 37 16 20 40 24 33 19 36 18 29 31 32 34 26 30 23] -------------------------------------------------------------------------------- YearsInCurrentRole [ 4 7 0 2 5 9 8 3 6 13 1 15 14 16 11 10 12 18 17] -------------------------------------------------------------------------------- YearsSinceLastPromotion [ 0 1 3 2 7 4 8 6 5 15 9 13 12 10 11 14] -------------------------------------------------------------------------------- YearsWithCurrManager [ 5 7 0 2 6 8 3 11 17 1 4 12 9 10 15 13 16 14] --------------------------------------------------------------------------------
# prints out names of columns
print(df.columns)
Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
'JobInvolvement', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
'MonthlyIncome', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
'RelationshipSatisfaction', 'TotalWorkingYears',
'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
# checking the df shape
print('The df has {} rows and {} columns.'.format(*df.shape))
#The df has 10339 rows and 15 columns.
The df has 1470 rows and 25 columns.
# prints out the first row for the dataset
print(df.head(1).transpose())
0 Age 41 Attrition Yes BusinessTravel Travel_Rarely Department Sales DistanceFromHome 1 Education College EducationField Life Sciences EnvironmentSatisfaction 2 Gender Female JobInvolvement 3 JobRole Sales Executive JobSatisfaction 4 MaritalStatus Single MonthlyIncome 5993 NumCompaniesWorked 8 OverTime Yes PercentSalaryHike 11 RelationshipSatisfaction 1 TotalWorkingYears 8 TrainingTimesLastYear 0 WorkLifeBalance 1 YearsAtCompany 6 YearsInCurrentRole 4 YearsSinceLastPromotion 0 YearsWithCurrManager 5
# checking for counts data and gives Mean, Sd and quartiles for all columns
print(df.describe().transpose())
count mean std min 25% \
Age 1470.0 36.923810 9.135373 18.0 30.0
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.0
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.0
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.0
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.0
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.0
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.0
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.0
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.0
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.0
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.0
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.0
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.0
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.0
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.0
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.0
50% 75% max
Age 36.0 43.0 60.0
DistanceFromHome 7.0 14.0 29.0
EnvironmentSatisfaction 3.0 4.0 4.0
JobInvolvement 3.0 3.0 4.0
JobSatisfaction 3.0 4.0 4.0
MonthlyIncome 4919.0 8379.0 19999.0
NumCompaniesWorked 2.0 4.0 9.0
PercentSalaryHike 14.0 18.0 25.0
RelationshipSatisfaction 3.0 4.0 4.0
TotalWorkingYears 10.0 15.0 40.0
TrainingTimesLastYear 3.0 3.0 6.0
WorkLifeBalance 3.0 3.0 4.0
YearsAtCompany 5.0 9.0 40.0
YearsInCurrentRole 3.0 7.0 18.0
YearsSinceLastPromotion 1.0 3.0 15.0
YearsWithCurrManager 3.0 7.0 17.0
The following breakdown of the 'Attrition' column was observed. This means that, in the dataset, the number of employees stayed in the company is more than the number of employees that left the company.
| Attrition | # of Employees |
|---|---|
| Stayed (No) | 1233 |
| Left (Yes) | 237 |
The dataset is unbalanced because there isn't the same number of employees in each class (stayed or left). This can make modelling difficult and inaccurate, which would rrequire upsampling or downsampling.
# changing the value in column 'Attrition' for user's ease
df['Attrition'] = df['Attrition'].str.replace('No', 'Stayed')
df['Attrition'] = df['Attrition'].str.replace('Yes', 'Left')
# Attrition
df.groupby(['Attrition']).size().sort_values(ascending=False)
Attrition Stayed 1233 Left 237 dtype: int64
Plotly's interactive graphs, was used to visualise categorical variables and seaborn was used to create pairplots, which examines correlations and distributions, and heatmaps.
Tableau Dashboard was also used to visualise and built a story with the dataset and is embedded into this Jupyter Notebook.
# ATTRITION
attrition = df.groupby(['Attrition']).size().sort_values(ascending=False)
attrition = pd.DataFrame(attrition)
attrition = attrition.reset_index(drop=False)
attrition = attrition.rename(columns={0 : "size"})
fig = px.bar(x = attrition["Attrition"], y = attrition["size"],
color = attrition["Attrition"])
fig.update_layout(title = "Attrition Rate",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition")
fig.update_layout(height=600, width=600)
fig.show()
Insight
The leavers are classified as the minority class in this project due to the small number of observations, which the stayers are classified as the majority class due to the much higher number of observations.
# EDUCATION
Education = df.groupby(['Education', 'Attrition']).size().sort_values(ascending=False)
Education = Education.reset_index(drop=False)
Education = Education.rename(columns={0 : "size"})
fig = px.bar(x = Education["Education"],
y = Education["size"],
color = Education["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Below College','College','Bachelor', 'Master', 'Doctor']
fig.update_layout(title = "Breakdown of Education",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
Graph shows the distribution of employees by category, rather than given proper insight.
# EDUCATIONFIELD
EducationField = df.groupby(['EducationField', 'Attrition']).size().sort_values(ascending=False)
EducationField = EducationField.reset_index(drop=False)
EducationField = EducationField.rename(columns={0 : "size"})
fig = px.bar(x = EducationField["EducationField"],
y = EducationField["size"],
color = EducationField["Attrition"])
fig.update_layout(title = "Breakdown of Education Field",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
Graph shows the distribution of employees by category, rather than given proper insight.
# BUSINESS TRAVEL
BusinessTravel = df.groupby(['BusinessTravel', 'Attrition']).size().sort_values(ascending=False)
BusinessTravel = pd.DataFrame(BusinessTravel)
BusinessTravel = BusinessTravel.reset_index(drop=False)
BusinessTravel = BusinessTravel.rename(columns={0 : "size"})
fig = px.bar(x = BusinessTravel["BusinessTravel"],
y = BusinessTravel["size"],
color = BusinessTravel["Attrition"])
fig.update_layout(title = "Business Travel",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
It seems that employees who do not require business travel leave the company the least.
# DEPARTMENT
Department = df.groupby(['Department', 'Attrition']).size().sort_values(ascending=False)
Department = pd.DataFrame(Department)
Department = Department.reset_index(drop=False)
Department = Department.rename(columns={0 : "size"})
fig = px.bar(x = Department["Department"],
y = Department["size"],
color = Department["Attrition"])
fig.update_layout(title = "Breakdown by Department",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
Graph shows the distribution of employees by category, rather than given proper insight.
# GENDER
gender = df.groupby(['Gender', 'Attrition']).size().sort_values(ascending=False)
gender = pd.DataFrame(gender)
gender = gender.reset_index(drop=False)
gender = gender.rename(columns={0 : "size"})
fig = px.bar(x = gender["Gender"],
y = gender["size"],
color = gender["Attrition"])
fig.update_layout(title = "Breakdown of Genders",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
Due to the fact that more men are employed in the company than women, this means that there is same proportion of men and men leaving. Graph shows the distribution of employees by category, rather than given proper insight.
# EDUCATION
OverTime = df.groupby(['OverTime', 'Attrition']).size().sort_values(ascending=False)
OverTime = OverTime.reset_index(drop=False)
OverTime = OverTime.rename(columns={0 : "size"})
fig = px.bar(x = OverTime["OverTime"],
y = OverTime["size"],
color = OverTime["Attrition"])
fig.update_layout(title = "Breakdown of Over Time",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
There is a not big difference betweeen stayers and leavers on working overtime. However, this means that working overtime is not the only factors that make employees leave.
# JOBROLE
jobrole = df.groupby(['JobRole', 'Attrition']).size().sort_values(ascending=False)
jobrole = pd.DataFrame(jobrole)
jobrole = jobrole.reset_index(drop=False)
jobrole = jobrole.rename(columns={0 : "size"})
fig = px.bar(x = jobrole["JobRole"],
y = jobrole["size"],
color = jobrole["Attrition"])
fig.update_layout(title = "Breakdown of Job Role",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
Several roles have higher employee turnover than others. For example, nearly half of all Sales Representatives have left their position. Meanwhile, there is a higher rate of turnover in Sales Executives, Research Scientists and Labotary technicians than other positions.
# MARITALSTATUS
maritalstatus = df.groupby(['MaritalStatus', 'Attrition']).size().sort_values(ascending=False)
maritalstatus = pd.DataFrame(maritalstatus)
maritalstatus = maritalstatus.reset_index(drop=False)
maritalstatus = maritalstatus.rename(columns={0 : "size"})
fig = px.bar(x = maritalstatus["MaritalStatus"],
y = maritalstatus["size"],
color = maritalstatus["Attrition"],
# not working correctly
category_orders={"MaritalStatus": ['Single','Married','Divorced']})
fig.update_layout(title = "Breakdown of Marital Status",
legend_title_text="Attrition",
yaxis_title = 'Number of Employees',
xaxis_title = None,
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
There seems to be more leavers who are single, rather than those who are married and divorced. This is an interesting observation, which will have to be explored more.
Perhaps, single people are between 20-35 and are on low wage and are leaving to get a higher income. This will have to be further explored.
# WORKLIFEBALANCE
worklifebalance = df.groupby(['WorkLifeBalance', 'Attrition']).size().sort_values(ascending=False)
worklifebalance = worklifebalance.reset_index(drop=False)
worklifebalance = worklifebalance.rename(columns={0 : "size"})
fig = px.bar(x = worklifebalance["WorkLifeBalance"],
y = worklifebalance["size"],
color = worklifebalance["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Bad','Good','Better', 'Best']
fig.update_layout(title = "Breakdown of Work Life Balance",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
C:\Users\Karina\miniconda3\lib\site-packages\numpy\core\numeric.py:2446: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
Insight
This graph is interesting, because 766 stayers state that they have a bad work life balance, while only 127 actually left the company.
Fifty-five stayers voted that they have the 'best' work life balance, meanwhile, 25 leavers voted the same, which is nearly half of the employees who voted that they have the best work life balance. This is interesting because this slightly suggest that having the best life balance does not stop employees from leaving
It was decided to reconstruct the categories in Work Life Balance. Due to the, it was decided to place 'Better' and 'Best' in 'Good'. It was seen as semantically, there is not that much of a difference between the three categories and it will help to condense the categories.
# # Changing the numbers to appropriate category
# df['WorkLifeBalance'] = df['WorkLifeBalance'].replace('Better', 'Good')
# df['WorkLifeBalance'] = df['WorkLifeBalance'].replace('Best', 'Good')
# # converting variable into a categorical variable
# df['WorkLifeBalance'] = df['WorkLifeBalance'].astype('category')
# # #reordering - important in graphs
# df['WorkLifeBalance'] = df['WorkLifeBalance'].cat.reorder_categories(['Bad', 'Good'], ordered=True)
# WORKLIFEBALANCE
worklifebalance = df.groupby(['WorkLifeBalance', 'Attrition']).size().sort_values(ascending=False)
worklifebalance = worklifebalance.reset_index(drop=False)
worklifebalance = worklifebalance.rename(columns={0 : "size"})
fig = px.bar(x = worklifebalance["WorkLifeBalance"],
y = worklifebalance["size"],
color = worklifebalance["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Bad','Good']
fig.update_layout(title = "Breakdown of Work Life Balance After Reconstruction of Category",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
# JobSatisfaction
JobSatisfaction = df.groupby(['JobSatisfaction', 'Attrition']).size().sort_values(ascending=False)
JobSatisfaction = JobSatisfaction.reset_index(drop=False)
JobSatisfaction = JobSatisfaction.rename(columns={0 : "size"})
fig = px.bar(x = JobSatisfaction["JobSatisfaction"],
y = JobSatisfaction["size"],
color = JobSatisfaction["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Low','Medium','High', 'Very High']
fig.update_layout(title = "Breakdown of Job Satisfaction",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
Insight
The number of leavers seems to be consisted throughout the categories. A high number of leavers stated that high job satisfaction still left the job.
# JobInvolvement
JobInvolvement = df.groupby(['JobInvolvement', 'Attrition']).size().sort_values(ascending=False)
JobInvolvement = JobInvolvement.reset_index(drop=False)
JobInvolvement = JobInvolvement.rename(columns={0 : "size"})
fig = px.bar(x = JobInvolvement["JobInvolvement"],
y = JobInvolvement["size"],
color = JobInvolvement["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Low','Medium','High', 'Very High']
fig.update_layout(title = "Breakdown of Job Involvement",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
# EnvironmentSatisfaction
EnvironmentSatisfaction = df.groupby(['EnvironmentSatisfaction', 'Attrition']).size().sort_values(ascending=False)
EnvironmentSatisfaction = EnvironmentSatisfaction.reset_index(drop=False)
EnvironmentSatisfaction = EnvironmentSatisfaction.rename(columns={0 : "size"})
fig = px.bar(x = EnvironmentSatisfaction["EnvironmentSatisfaction"],
y = EnvironmentSatisfaction["size"],
color = EnvironmentSatisfaction["Attrition"])
for idx in range(len(fig.data)):
fig.data[idx].x = ['Low','Medium','High', 'Very High']
fig.update_layout(title = "Breakdown of Environment Satisfaction",
yaxis_title = 'Number of Employees',
xaxis_title = None,
legend_title_text="Attrition",
barmode='group')
fig.update_layout(height=600, width=600)
fig.show()
/usr/local/lib/python3.7/dist-packages/numpy/core/numeric.py:2446: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
Correlations between pairs of features were examined by running a tabulated correlations of numerical values, followed by a heatmap for easier visualisation.
# taking only numeric df
numeric = df[['Age',
'DistanceFromHome',
'MonthlyIncome',
'NumCompaniesWorked',
'PercentSalaryHike',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsSinceLastPromotion',
'YearsWithCurrManager',
'YearsInCurrentRole',
'EnvironmentSatisfaction',
'JobInvolvement',
'JobSatisfaction'
]]
# Check for high correlations between features
z = numeric.corr()
# Creting a heatmap with Seaborn
# easier to identify correlation than tabulated figures
sns.set(rc={'figure.figsize':(16, 12)})
ax = plt.axes()
ax.set_title("Emloyee Attrition Variable Correlation Heatmap", fontsize = 16)
sns.heatmap(z, cmap="seismic", annot=True, vmin=-1, vmax=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7f46fd3ba7d0>
There was a correlation between YearsAtCompany and YearsInCurrentRole (r = 0.76), YearsAtCompany and YearsSinceLastPromotion (r = 0.62), YearsAtCompany and YearsWithCurrManager (r = 0.77), YearsInCrrentRole and YearsSinceLastPromotion (r = 0.55), YearsInCurrentRole and YearswithCurrManager (r = 0.71).
Pairplot was used to examine the visualise correlations through scatterplots and examine distribution through histograms.
# creating a pairplot examining the scatterplots and disribution of all numeric variables
sns.set(style = 'ticks', color_codes=True)
sns.pairplot(data=df, hue='Attrition',
vars=['Age',
'DistanceFromHome',
'MonthlyIncome',
'NumCompaniesWorked',
'PercentSalaryHike',
'TotalWorkingYears',
'TrainingTimesLastYear',
'YearsAtCompany',
'YearsSinceLastPromotion',
'YearsWithCurrManager',
'YearsInCurrentRole',
])
<seaborn.axisgrid.PairGrid at 0x7f46fd3ba750>
Skewness and Kurtosis explore whether the data is normally distributed. Skewness exammines whether the datas is curved to the left or to the right, leading to be distorted or assymetric, while kurtosis examines the peak of the distribution: is too flat or too narrow.
Both are deviation from normal distribution, which looks like a bell, hence the name: bell curve.
George & Mallery (2010) consider measures between -2 and +2 to be appropriate and so according to this criteria, no variables are seen not meeting this criteria.
Hair et al. (2010) and Byrne (2010) consider measures of ‐2 to +2 for skew and -7 to +7 for kurtosis, to suggest normal distribution of the variables - no variables are seen not meeting this criteria.
# review skewness and kurtosis
numeric.agg(['skew', 'kurtosis']).transpose()
| skew | kurtosis | |
|---|---|---|
| Age | 0.413286 | -0.404145 |
| DistanceFromHome | 0.958118 | -0.224833 |
| MonthlyIncome | 1.369817 | 1.005233 |
| NumCompaniesWorked | 1.026471 | 0.010214 |
| PercentSalaryHike | 0.821128 | -0.300598 |
| TotalWorkingYears | 1.117172 | 0.918270 |
| TrainingTimesLastYear | 0.553124 | 0.494993 |
| WorkLifeBalance | -0.552480 | 0.419460 |
| YearsAtCompany | 1.764529 | 3.935509 |
| YearsSinceLastPromotion | 1.984290 | 3.612673 |
| YearsWithCurrManager | 0.833451 | 0.171058 |
| YearsInCurrentRole | 0.917363 | 0.477421 |
| EnvironmentSatisfaction | -0.321654 | -1.202521 |
| JobInvolvement | -0.498419 | 0.270999 |
| JobSatisfaction | -0.329672 | -1.222193 |
Due to the high correlation between some variables, it was decided to run VIF to analyse the multicollinearity of the variables. Multicollinearity can produce estimates of the regression coefficients that are not statistically significant. When two or more independent variables are highly correlated, it becomes difficult to state which variable is really influencing the independent variable (Gil, Sousa and Verleysen, 2013).
# creating a constant for the VIF analysis
numeric = sm.add_constant(numeric)
# Run VIF for all variable in original dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(numeric.values, i) for i in range(numeric.shape[1])]
vif["features"] = numeric.columns
vif.round(1)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
| VIF Factor | features | |
|---|---|---|
| 0 | 89.5 | const |
| 1 | 2.0 | Age |
| 2 | 1.0 | DistanceFromHome |
| 3 | 2.5 | MonthlyIncome |
| 4 | 1.2 | NumCompaniesWorked |
| 5 | 1.0 | PercentSalaryHike |
| 6 | 4.6 | TotalWorkingYears |
| 7 | 1.0 | TrainingTimesLastYear |
| 8 | 1.0 | WorkLifeBalance |
| 9 | 4.6 | YearsAtCompany |
| 10 | 1.7 | YearsSinceLastPromotion |
| 11 | 2.8 | YearsWithCurrManager |
| 12 | 2.7 | YearsInCurrentRole |
| 13 | 1.0 | EnvironmentSatisfaction |
| 14 | 1.0 | JobInvolvement |
| 15 | 1.0 | JobSatisfaction |
Based on the above exploration, the following columns were dropped: 'YearsWithCurrManager' and 'YearsInCurrentRole' due to the high correlation with other variables.
VIF showed a limited multicollinearity, which did not require further removal of variables.
Skewness and kurtosis also showed normally distributed data so that means that data does not require further treatment to become normally distributed.
Correlations and VIF were rerun again to ensure that correlations and multicollinerity was dealt with.
# taking only numeric df
numeric = df[['Age',
'DistanceFromHome',
'MonthlyIncome',
'NumCompaniesWorked',
'PercentSalaryHike',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsSinceLastPromotion',
# 'YearsWithCurrManager',
# 'YearsInCurrentRole',
'EnvironmentSatisfaction',
'JobInvolvement',
'JobSatisfaction'
]]
# Check for high correlations between features
z = numeric.corr()
# Creting a heatmap with Seaborn
# easier to identify correlation than tabulated figures
sns.set(rc={'figure.figsize':(16, 12)})
ax = plt.axes()
ax.set_title("Emloyee Attrition Variable Correlation Heatmap", fontsize = 16)
sns.heatmap(z, cmap="seismic", annot=True, vmin=-1, vmax=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7f46f5b6be50>
# creating a constant for the VIF analysis
numeric = sm.add_constant(numeric)
# Run VIF for all variable in original dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(numeric.values, i) for i in range(numeric.shape[1])]
vif["features"] = numeric.columns
vif.round(1)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
| VIF Factor | features | |
|---|---|---|
| 0 | 89.2 | const |
| 1 | 2.0 | Age |
| 2 | 1.0 | DistanceFromHome |
| 3 | 2.5 | MonthlyIncome |
| 4 | 1.2 | NumCompaniesWorked |
| 5 | 1.0 | PercentSalaryHike |
| 6 | 4.6 | TotalWorkingYears |
| 7 | 1.0 | TrainingTimesLastYear |
| 8 | 1.0 | WorkLifeBalance |
| 9 | 2.6 | YearsAtCompany |
| 10 | 1.6 | YearsSinceLastPromotion |
| 11 | 1.0 | EnvironmentSatisfaction |
| 12 | 1.0 | JobInvolvement |
| 13 | 1.0 | JobSatisfaction |
# dropping varliea
df = df.drop(['YearsWithCurrManager',
'YearsInCurrentRole'], axis=1)
%%HTML
<div class='tableauPlaceholder' id='viz1646135795375' style='position: relative'><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Employee_Attrition_16431209161820/Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1646135795375'); var vizElement = divElement.getElementsByTagName('object')[0]; if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='2077px';} var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
After dropping irrelavant and highly correlated variables, there were 23 variables left with 1470 observations in the dataset.
# checking the df shape
print('The df has {} rows and {} columns.'.format(*df.shape))
The df has 1470 rows and 23 columns.
print(df.head(1).transpose())
0 Age 41 Attrition Left BusinessTravel Travel_Rarely Department Sales DistanceFromHome 1 Education College EducationField Life Sciences EnvironmentSatisfaction 2 Gender Female JobInvolvement 3 JobRole Sales Executive JobSatisfaction 4 MaritalStatus Single MonthlyIncome 5993 NumCompaniesWorked 8 OverTime Yes PercentSalaryHike 11 RelationshipSatisfaction 1 TotalWorkingYears 8 TrainingTimesLastYear 0 WorkLifeBalance 1 YearsAtCompany 6 YearsSinceLastPromotion 0
The variables that had categorical values such as 'Marital Status' which had 'Single', 'Married', 'Divorced', was split into 3 columns, one for each of the categories represented in the data and converted to integers.
If the employee was married, the value of 1 was assigned to MaritalStatus_Married, and 0 assigned to all other 2 MaritalStatus columns (MaritalStatus_Single, MaritalStatus_Divorced). This increased the overall column count to 48 as our final_data variable shows.
df['OverTime'] = df['OverTime'].astype('str')
# NO is 0 because 'N' is before 'Y' alphabetically
df['OverTime'] = df['OverTime'].str.replace('No', '0')
df['OverTime'] = df['OverTime'].str.replace('Yes', '1')
df['OverTime'] = df['OverTime'].astype('int')
df['Attrition'] = df['Attrition'].astype('str')
# NO (changed later to 'Left') is 0 because 'N' is before 'Y'(changed later to 'Stayed') alphabetically
df['Attrition'] = df['Attrition'].str.replace('Stayed', '0')
df['Attrition'] = df['Attrition'].str.replace('Left', '1')
df['Attrition'] = df['Attrition'].astype('int')
df['Gender'] = df['Gender'].astype('str')
# Female is 0 because 'F' is before 'M' alphabetically
df['Gender'] = df['Gender'].str.replace('Female', '0')
df['Gender'] = df['Gender'].str.replace('Male', '1')
df['Gender'] = df['Gender'].astype('int')
# creating dummies for the dataset
final_data = pd.get_dummies(df)
print(final_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null int64 2 DistanceFromHome 1470 non-null int64 3 EnvironmentSatisfaction 1470 non-null int64 4 Gender 1470 non-null int64 5 JobInvolvement 1470 non-null int64 6 JobSatisfaction 1470 non-null int64 7 MonthlyIncome 1470 non-null int64 8 NumCompaniesWorked 1470 non-null int64 9 OverTime 1470 non-null int64 10 PercentSalaryHike 1470 non-null int64 11 RelationshipSatisfaction 1470 non-null int64 12 TotalWorkingYears 1470 non-null int64 13 TrainingTimesLastYear 1470 non-null int64 14 WorkLifeBalance 1470 non-null int64 15 YearsAtCompany 1470 non-null int64 16 YearsSinceLastPromotion 1470 non-null int64 17 BusinessTravel_Non-Travel 1470 non-null uint8 18 BusinessTravel_Travel_Frequently 1470 non-null uint8 19 BusinessTravel_Travel_Rarely 1470 non-null uint8 20 Department_Human Resources 1470 non-null uint8 21 Department_Research & Development 1470 non-null uint8 22 Department_Sales 1470 non-null uint8 23 Education_Below College 1470 non-null uint8 24 Education_College 1470 non-null uint8 25 Education_Bachelor 1470 non-null uint8 26 Education_Master 1470 non-null uint8 27 Education_Doctor 1470 non-null uint8 28 EducationField_Human Resources 1470 non-null uint8 29 EducationField_Life Sciences 1470 non-null uint8 30 EducationField_Marketing 1470 non-null uint8 31 EducationField_Medical 1470 non-null uint8 32 EducationField_Other 1470 non-null uint8 33 EducationField_Technical Degree 1470 non-null uint8 34 JobRole_Healthcare Representative 1470 non-null uint8 35 JobRole_Human Resources 1470 non-null uint8 36 JobRole_Laboratory Technician 1470 non-null uint8 37 JobRole_Manager 1470 non-null uint8 38 JobRole_Manufacturing Director 1470 non-null uint8 39 JobRole_Research Director 1470 non-null uint8 40 JobRole_Research Scientist 1470 non-null uint8 41 JobRole_Sales Executive 1470 non-null uint8 42 JobRole_Sales Representative 1470 non-null uint8 43 MaritalStatus_Single 1470 non-null uint8 44 MaritalStatus_Married 1470 non-null uint8 45 MaritalStatus_Divorced 1470 non-null uint8 dtypes: int64(17), uint8(29) memory usage: 237.0 KB None
| Feature | Description | Value |
|---|---|---|
| BusinessTravel_Travel_Frequently | Does not travel frequently | 0 |
| BusinessTravel_Travel_Rarely | Does not travel raretly | 0 |
| BusinessTravel_Non-Travel | Does not travel for work | 1 |
With dummy variables it is possible to drop a variable because BusinessTravel_Travel_Frequently' and 'BusinessTravel_Travel_Rarely' are 0, then that means 'BusinessTravel_Non-Travel' is 1. One category is left out, and missing category is called the reference category. Using the reference category makes all interpretation in reference to that category. It also unsures multicollinearity in the variables.
In the end there are 3 variables in stead of 3 variable ad reduces the number of columns we have.
| Feature | Description | Value |
|---|---|---|
| BusinessTravel_Travel_Frequently | Does not travel frequently | 0 |
| BusinessTravel_Travel_Rarely | Does not travel raretly | 0 |
The same was conducted with other variables relating to 'Education', 'Department', 'JobRole' an 'MaritalStatus'
final_data = final_data.drop([#'BusinessTravel_Non-Travel',
'BusinessTravel_Travel_Frequently',
# 'BusinessTravel_Travel_Rarely',
# 'Department_Human Resources',
# 'Department_Research & Development',
'Department_Sales',
# 'Education_Below College',
# 'Education_College',
# 'Education_Bachelor',
# 'Education_Master',
'Education_Doctor',
# 'EducationField_Human Resources',
# 'EducationField_Life Sciences',
# 'EducationField_Marketing',
# 'EducationField_Medical',
'EducationField_Other',
# 'EducationField_Technical Degree',
# 'JobRole_Healthcare Representative',
# 'JobRole_Human Resources',
# 'JobRole_Laboratory Technician',
# 'JobRole_Manager',
# 'JobRole_Manufacturing Director',
# 'JobRole_Research Director',
# 'JobRole_Research Scientist',
# 'JobRole_Sales Executive',
'JobRole_Sales Representative',
'MaritalStatus_Single',
# 'MaritalStatus_Married',
# 'MaritalStatus_Divorced'
], axis=1)
print('Number of variables left:',len(final_data.columns))
Number of variables left: 40
for i in final_data.columns:
print(i)
Age Attrition DistanceFromHome EnvironmentSatisfaction Gender JobInvolvement JobSatisfaction MonthlyIncome NumCompaniesWorked OverTime PercentSalaryHike RelationshipSatisfaction TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsSinceLastPromotion BusinessTravel_Non-Travel BusinessTravel_Travel_Rarely Department_Human Resources Department_Research & Development Education_Below College Education_College Education_Bachelor Education_Master EducationField_Human Resources EducationField_Life Sciences EducationField_Marketing EducationField_Medical EducationField_Technical Degree JobRole_Healthcare Representative JobRole_Human Resources JobRole_Laboratory Technician JobRole_Manager JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive MaritalStatus_Married MaritalStatus_Divorced
# renaming colum types
final_data= final_data.rename(columns={"Department_Human Resources" : "Department_Human_Resources",
"Department_Research & Development" : "Department_Research_&_Development",
"Education_Below College" : "Education_Below_College",
"EducationField_Human Resources" : "EducationField_Human_Resources",
"EducationField_Life Sciences" : "EducationField_Life_Sciences",
"EducationField_Technical Degree" : "EducationField_Technical_Degree",
"JobRole_Healthcare Representative" : "JobRole_Healthcare_Representative",
"JobRole_Human Resources" : "JobRole_Human_Resources",
"JobRole_Laboratory Technician" : "JobRole_Laboratory_Technician",
"JobRole_Manufacturing Director" : "JobRole_Manufacturing_Director",
"JobRole_Research Director" : "JobRole_Research_Director",
"JobRole_Research Scientist" : "JobRole_Research_Scientist",
"JobRole_Sales Executive" : "JobRole_Sales_Executive"
})
print(final_data.columns)
Index(['Age', 'Attrition', 'DistanceFromHome', 'EnvironmentSatisfaction',
'Gender', 'JobInvolvement', 'JobSatisfaction', 'MonthlyIncome',
'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
'RelationshipSatisfaction', 'TotalWorkingYears',
'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
'YearsSinceLastPromotion', 'BusinessTravel_Non-Travel',
'BusinessTravel_Travel_Rarely', 'Department_Human_Resources',
'Department_Research_&_Development', 'Education_Below_College',
'Education_College', 'Education_Bachelor', 'Education_Master',
'EducationField_Human_Resources', 'EducationField_Life_Sciences',
'EducationField_Marketing', 'EducationField_Medical',
'EducationField_Technical_Degree', 'JobRole_Healthcare_Representative',
'JobRole_Human_Resources', 'JobRole_Laboratory_Technician',
'JobRole_Manager', 'JobRole_Manufacturing_Director',
'JobRole_Research_Director', 'JobRole_Research_Scientist',
'JobRole_Sales_Executive', 'MaritalStatus_Married',
'MaritalStatus_Divorced'],
dtype='object')
print(final_data.shape)
(1470, 40)
for i in final_data.columns:
print(i)
Age Attrition DistanceFromHome EnvironmentSatisfaction Gender JobInvolvement JobSatisfaction MonthlyIncome NumCompaniesWorked OverTime PercentSalaryHike RelationshipSatisfaction TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsSinceLastPromotion BusinessTravel_Non-Travel BusinessTravel_Travel_Rarely Department_Human_Resources Department_Research_&_Development Education_Below_College Education_College Education_Bachelor Education_Master EducationField_Human_Resources EducationField_Life_Sciences EducationField_Marketing EducationField_Medical EducationField_Technical_Degree JobRole_Healthcare_Representative JobRole_Human_Resources JobRole_Laboratory_Technician JobRole_Manager JobRole_Manufacturing_Director JobRole_Research_Director JobRole_Research_Scientist JobRole_Sales_Executive MaritalStatus_Married MaritalStatus_Divorced
| Attribute | DataType | Description |
|---|---|---|
| Age | int | Age of an employee: range 18 to 60 |
| DistanceFromHome | int | Number of miles away from home: range 1 to 29 |
| JobLevel | int | Range: 1 to 5 |
| MonthlyIncome | int | Monthly Income of the Employee: Ranges from 1009 to 19999 |
| NumCompaniesWorked | int | Number of companies the employee had worked for: Ranges from 0 to 9 |
| PercentSalaryHike | int | Percentage of Salary Hike |
| TotalWorkingYears | int | Total Working Years |
| TrainingTimesLastYear | int | The number of training times last year: Ranges from 0 to 6 |
| YearsAtCompany | int | Number of years at the current company |
| YearsSinceLastPromotion | int | Number of years since last promotion: Ranges from 0 to 15 |
| WorkLifeBalance | int | Bad = 1, Good = 2, Better = 3, Best = 4 |
| EnvironmentSatisfaction | int | Low = 1, Medium = 2, High = 3, Very High = 4 |
| JobInvolvement | int | Low = 1, Medium = 2, High = 3, Very High = 4 |
| JobSatisfaction | int | Low = 1, Medium = 2, High = 3, Very High = 4 |
| Gender | Binomial | Female = 0, Male = 1 |
| OverTime | Binomial | No = 0, Yes = 1 |
| RelationshipSatisfaction | Binomial | No = 0, Yes = 1 |
| BusinessTravel_Non-Travel | Binomial | No = 0, Yes = 1 |
| BusinessTravel_Travel_Frequently | Binomial | No = 0, Yes = 1 |
| BusinessTravel_Travel_Rarely | Binomial | No = 0, Yes = 1 |
| Department_Human Resources | Binomial | No = 0, Yes = 1 |
| Department_Research & Development | Binomial | No = 0, Yes = 1 |
| Department_Sales | Binomial | No = 0, Yes = 1 |
| Education_Below College | Binomial | No = 0, Yes = 1 |
| Education_College | Binomial | No = 0, Yes = 1 |
| Education_Bachelor | Binomial | No = 0, Yes = 1 |
| Education_Master | Binomial | No = 0, Yes = 1 |
| Education_Doctor | Binomial | No = 0, Yes = 1 |
| EducationField_Human Resources | Binomial | No = 0, Yes = 1 |
| EducationField_Life Sciences | Binomial | No = 0, Yes = 1 |
| EducationField_Marketing | Binomial | No = 0, Yes = 1 |
| EducationField_Medical | Binomial | No = 0, Yes = 1 |
| EducationField_Other | Binomial | No = 0, Yes = 1 |
| EducationField_Technical Degree | Binomial | No = 0, Yes = 1 |
| JobRole_Healthcare Representative | Binomial | No = 0, Yes = 1 |
| JobRole_Human Resources | Binomial | No = 0, Yes = 1 |
| JobRole_Laboratory Technician | Binomial | No = 0, Yes = 1 |
| JobRole_Manager | Binomial | No = 0, Yes = 1 |
| JobRole_Manufacturing Director | Binomial | No = 0, Yes = 1 |
| JobRole_Research Director | Binomial | No = 0, Yes = 1 |
| JobRole_Research Scientist | Binomial | No = 0, Yes = 1 |
| JobRole_Sales Executive | Binomial | No = 0, Yes = 1 |
| JobRole_Sales Representative | Binomial | No = 0, Yes = 1 |
| MaritalStatus_Single | Binomial | No = 0, Yes = 1 |
| MaritalStatus_Married | Binomial | No = 0, Yes = 1 |
| MaritalStatus_Divorced | Binomial | No = 0, Yes = 1 |
Please note: DistanceFromHome was assumed to be in miles, this was not provided in the data dictionary.
| Attribute | DataType | Description |
|---|---|---|
| Attrition | Binomial | Did the employee leave or not: Yes or No? |
## Dividing dataset into label and feature sets
X = final_data.drop('Attrition', axis = 1)
# axis = 1 means we are dropping a column - as this is the
# target variable, 1 denotes column, 0 denotes row (Just Indepedent Variables - so drop the target var)
Y = final_data['Attrition'] # Labels
print(type(X))
print(type(Y))
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>
# so we now rows shown and have 63 independent variables
print(X.shape)
# total rows and a single column
print(Y.shape)
(1470, 39) (1470,)
There are a lot of range betweenvariables such as Age (M = 36.92, SD = 9.13, Range = 18 - 60) and Monthly Income (M = 6 502.93, SD = 4 707.96, Range = 1 009 - 19 999)
df[['Age', 'MonthlyIncome']].describe()
Due to the large variation in variables for mean and standard deviation, the data was normalised, so that each variable has a mean of 0 and a variance of 1.
This is performed to ensure that there is equal importance placed on all the features.
If this is not performed, then the machine learning model would assume that the ‘MonthlyIncome’ was of more significance than the other variables such as 'Age' – simply because of its higher values.
Only numeric variables were normalised such as Age and Monthly Income.
Variables such as Gender, which were converted to 0 and 1, were excluded from normalisation because O is for Females and 1 is for Males (alphabetically assigned).
Variables that were created with dummy variables, such as MaritalStatus_Single, MaritalStatus_Married and MaritalStatus_Divorces, also had 0 and 1 values, were excluded from normalisation because O is for No and 1 is for Yes (alphabetically assigned).
# taking only numeric df
numeric = df[['Age',
'DistanceFromHome',
'MonthlyIncome',
'NumCompaniesWorked',
'PercentSalaryHike',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsSinceLastPromotion',
'EnvironmentSatisfaction',
'JobInvolvement', # 'WorkLifeBalance',
'JobSatisfaction'
]]
# column name list
numeric_columns = ['Age',
'DistanceFromHome',
'MonthlyIncome',
'NumCompaniesWorked',
'PercentSalaryHike',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsSinceLastPromotion',
'JobInvolvement', # 'WorkLifeBalance',
'JobSatisfaction']
# =============================================================================
# NORMALISING DATA
# =============================================================================
## NUMERIC VARIABLES
# creating a new numeric dataset where there are only numeric variables
# this will be used for normalisation
numeric_df = X[numeric_columns]
# normalising the numeric dataset
feature_scaler = StandardScaler()
X_scaled = feature_scaler.fit_transform(numeric_df) #this is now a normalised feature set
# creating a dataset and assigning column names to the columns
# column name was previously assigned when numeric data was decided.
X_scaled = pd.DataFrame(X_scaled, columns = numeric_columns)
## CATEGORICAL VARIABLES
# There are several non-numeric variables, that dont require normalisation
# Hence they are taken out of the dataset and merged with the normalised dataset.
# Creating the non-numeric dataset
# make the list with all columns
all_columns = list(X.columns)
# creating a new list
cat = []
# for every column in all_columns:
for i in all_columns:
# if column in numeric (list):
if i in numeric_columns:
# do nothing
continue
else:
# otherwise append the columns to the new list
# as a result getting a new list with all non-numeric list
cat.append(i)
# creating a new dataframe with all categorical variables (no numeric variables)
cat_df = X[cat].astype('int')
## MERGING TWO DATASETS BACK TOGETHER
X_scaled = pd.concat([X_scaled, cat_df], axis=1)
# EXAMINING
# examining the first observation and ensuring that it has worked correclty
print(X_scaled.head(1).transpose())
0 Age 0.446350 DistanceFromHome -1.010909 MonthlyIncome -0.108350 NumCompaniesWorked 2.125136 PercentSalaryHike -1.150554 TotalWorkingYears -0.421642 TrainingTimesLastYear -2.171982 WorkLifeBalance -2.493820 YearsAtCompany -0.164613 YearsSinceLastPromotion -0.679146 JobInvolvement 0.379672 JobSatisfaction 1.153254 EnvironmentSatisfaction 2.000000 Gender 0.000000 OverTime 1.000000 RelationshipSatisfaction 1.000000 BusinessTravel_Non-Travel 0.000000 BusinessTravel_Travel_Rarely 1.000000 Department_Human_Resources 0.000000 Department_Research_&_Development 0.000000 Education_Below_College 0.000000 Education_College 1.000000 Education_Bachelor 0.000000 Education_Master 0.000000 EducationField_Human_Resources 0.000000 EducationField_Life_Sciences 1.000000 EducationField_Marketing 0.000000 EducationField_Medical 0.000000 EducationField_Technical_Degree 0.000000 JobRole_Healthcare_Representative 0.000000 JobRole_Human_Resources 0.000000 JobRole_Laboratory_Technician 0.000000 JobRole_Manager 0.000000 JobRole_Manufacturing_Director 0.000000 JobRole_Research_Director 0.000000 JobRole_Research_Scientist 0.000000 JobRole_Sales_Executive 1.000000 MaritalStatus_Married 0.000000 MaritalStatus_Divorced 0.000000
X_scaled.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null float64 1 DistanceFromHome 1470 non-null float64 2 MonthlyIncome 1470 non-null float64 3 NumCompaniesWorked 1470 non-null float64 4 PercentSalaryHike 1470 non-null float64 5 TotalWorkingYears 1470 non-null float64 6 TrainingTimesLastYear 1470 non-null float64 7 WorkLifeBalance 1470 non-null float64 8 YearsAtCompany 1470 non-null float64 9 YearsSinceLastPromotion 1470 non-null float64 10 JobInvolvement 1470 non-null float64 11 JobSatisfaction 1470 non-null float64 12 EnvironmentSatisfaction 1470 non-null int64 13 Gender 1470 non-null int64 14 OverTime 1470 non-null int64 15 RelationshipSatisfaction 1470 non-null int64 16 BusinessTravel_Non-Travel 1470 non-null int64 17 BusinessTravel_Travel_Rarely 1470 non-null int64 18 Department_Human_Resources 1470 non-null int64 19 Department_Research_&_Development 1470 non-null int64 20 Education_Below_College 1470 non-null int64 21 Education_College 1470 non-null int64 22 Education_Bachelor 1470 non-null int64 23 Education_Master 1470 non-null int64 24 EducationField_Human_Resources 1470 non-null int64 25 EducationField_Life_Sciences 1470 non-null int64 26 EducationField_Marketing 1470 non-null int64 27 EducationField_Medical 1470 non-null int64 28 EducationField_Technical_Degree 1470 non-null int64 29 JobRole_Healthcare_Representative 1470 non-null int64 30 JobRole_Human_Resources 1470 non-null int64 31 JobRole_Laboratory_Technician 1470 non-null int64 32 JobRole_Manager 1470 non-null int64 33 JobRole_Manufacturing_Director 1470 non-null int64 34 JobRole_Research_Director 1470 non-null int64 35 JobRole_Research_Scientist 1470 non-null int64 36 JobRole_Sales_Executive 1470 non-null int64 37 MaritalStatus_Married 1470 non-null int64 38 MaritalStatus_Divorced 1470 non-null int64 dtypes: float64(12), int64(27) memory usage: 448.0 KB
# checking the df shape
print('The X_scaled has {} rows and {} columns.'.format(*X_scaled.shape))
The X_scaled has 1470 rows and 39 columns.
The dataset was split into training and test sets. Training set was taken as 70%, while test set was 30% of the dataset (X_train = 1029, X_test = 441). This is a typical standard split for data analysis.
## Dividing dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split( X_scaled, Y,stratify = Y, test_size = 0.3, random_state = 100)
print(X_train.shape) #70 % chosen random set
print(Y_test.shape) # 30 % randomly set test set
(1029, 39) (441,)
The training set data is unbalanced. The majority class is No with 1029 observations, implying that the employee stayed in the company. The minority class is Yes with 441 observations, imlying the employee left the company.
This means that data will have to be balanced. If the data was not balanced, then the model built would be good at predicting the majority, but not the minority class.
It was decided to oversample due to the small number of minority class, to bring the number of YES (employee left) samples up to the level of NO (employee stayed) samples. A technique called SMOTE is used to balanced the datase: Synthetic Minority Oversampling Technique. This creates artificial samples that are similar to existing ones, and inserts them into the existing samples in the minority class. Balancing was conducted on the training data, not the test set. The algorithm needs to learn as much as it can from the training set, to be used on the test data – as this would almost always be imbalanced as in the real world.
If under sampling had been undertaken instead, this would have led to a smaller number of the majority class, so less for the algorithm to learn from.
Stratified random sampling was used because it ensures that there is the same ratio of majority and minority class as in the whole dataset. Due to the small amount of minority class, it is vital that there is the correct ratio in test set, otherwise the models will overlearn the majority class.
The results of the oversampling show that each class is now balanced: No = 1029, Yes= 1029.
# =============================================================================
# SMOTE - OVERSAMPLING
# =============================================================================
# Implementing Oversampling to balance the dataset;
# SMOTE stands for Synthetic Minority Oversampling TEchnique
print("Number of observations in each class before oversampling (training data): \n",
pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train,Y_train = smote.fit_resample(X_train,Y_train)
print("Number of observations in each class after oversampling (training data): \n",
pd.Series(Y_train).value_counts())
Number of observations in each class before oversampling (training data): 0 863 1 166 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 863 1 863 Name: Attrition, dtype: int64
The following supervised ML classification algorithms were chosen to predict the minority class:
Importance Features: During the initial built of the ML algorithm, the ML state the importance of each variable on the outcome of the target variable. This will be used to rate the importance of variables and drop them as needed to increase the accuracy of the models.
Grid Search will be conducted using cross validation to evaluate model on a number of test sets so ensure it consistently behaves the same way. It devides the dataset it into 5 equal folds parts), 1 will become the test set, 4 parts training set. Then it rolls again making sure that each fold becomes a test set while the others are training sets. If consistent, implies its a good model. It will do this for each max depth of decision trees listed above it will then pick the max no: of decision trees which gives the best accurate model result.
During the process of this project, the features below were selected due to the fact that the models were built on those speicifc variables. When the features that are hashed out where manually placed into the model, they were not show to be significant. It was decided to drop those features and built initial models and fine tuned from those features.
# dropping variables to examine which variables become more significant
features = ['Age',
'DistanceFromHome',
'MonthlyIncome',
# 'NumCompaniesWorked',
'PercentSalaryHike',
# 'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
# 'YearsAtCompany',
'YearsSinceLastPromotion',
'JobInvolvement',
'JobSatisfaction',
# 'EnvironmentSatisfaction',
'Gender',
'OverTime',
# 'RelationshipSatisfaction',
'BusinessTravel_Non-Travel',
'BusinessTravel_Travel_Rarely',
'Department_Human_Resources',
# 'Department_Research_&_Development',
# 'Education_Below_College',
# 'Education_College',
# 'Education_Bachelor',
# 'Education_Master',
'EducationField_Human_Resources',
# 'EducationField_Life_Sciences',
# 'EducationField_Marketing',
'EducationField_Medical',
# 'EducationField_Technical_Degree',
'JobRole_Healthcare_Representative',
'JobRole_Human_Resources',
'JobRole_Laboratory_Technician',
# 'JobRole_Manager',
# 'JobRole_Manufacturing_Director',
# 'JobRole_Research_Director',
'JobRole_Research_Scientist',
'JobRole_Sales_Executive',
'MaritalStatus_Married',
'MaritalStatus_Divorced']
print('Number of Variables: ', len(features))
Number of Variables: 23
Decision Tree creates a tree structure to model the relationships among the predictors and the predicted outcome (Lantz, 2015);
The 'entropy' was means using 'information game' to split variables. Max depth only going to 5 levels because otherwise the model would overfit. The optimal level is unknown so it was decided to choose 5 as an experiment.
Feature imporance was also examine to establish the features that are most important in predicting which employees will leave. Variables with values 0 means they weren't use to construct the decision trees, because only level 5 was used.
# =============================================================================
# SELECTING FEATURES MENTIONED ABOVE/ SMOTING / SPLITTING DAT INTO TRAINING AND TEST DATA
# =============================================================================
# Dividing dataset into label and feature sets
X = X_scaled[features].copy()
y= final_data['Attrition'] # Labels
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
# splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)
# Implementing Oversampling to balance the dataset; SMOTE stands for Synthetic Minority Oversampling Technique
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train, Y_train = smote.fit_resample(X_train, Y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())
# checking that the features have been selected correctly
# examining the first observation
X.head(1).transpose()
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> (1470, 23) (1470,) Number of observations in each class before oversampling (training data): 0 986 1 190 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 986 1 986 Name: Attrition, dtype: int64
| 0 | |
|---|---|
| Age | 0.446350 |
| DistanceFromHome | -1.010909 |
| MonthlyIncome | -0.108350 |
| PercentSalaryHike | -1.150554 |
| TrainingTimesLastYear | -2.171982 |
| WorkLifeBalance | -2.493820 |
| YearsSinceLastPromotion | -0.679146 |
| JobInvolvement | 0.379672 |
| JobSatisfaction | 1.153254 |
| Gender | 0.000000 |
| OverTime | 1.000000 |
| BusinessTravel_Non-Travel | 0.000000 |
| BusinessTravel_Travel_Rarely | 1.000000 |
| Department_Human_Resources | 0.000000 |
| EducationField_Human_Resources | 0.000000 |
| EducationField_Medical | 0.000000 |
| JobRole_Healthcare_Representative | 0.000000 |
| JobRole_Human_Resources | 0.000000 |
| JobRole_Laboratory_Technician | 0.000000 |
| JobRole_Research_Scientist | 0.000000 |
| JobRole_Sales_Executive | 1.000000 |
| MaritalStatus_Married | 0.000000 |
| MaritalStatus_Divorced | 0.000000 |
# =============================================================================
# CLASSIFICATION DECISION TREE MODEL
# =============================================================================
seed = 42
dtree = tree.DecisionTreeClassifier(random_state = seed, criterion = 'entropy', max_depth = 5)
dtree.fit(X_train, Y_train)
Y_pred = dtree.predict(X_test)
print("Prediction Accuracy: ", metrics.accuracy_score(Y_test, Y_pred))
print('# ======================')
print('Confusion matrix Plot')
print('# ======================')
conf_mat = metrics.confusion_matrix(Y_test, Y_pred)
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('# ======================')
print('Confusion matrix')
print('# ======================')
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
print('# ======================')
print('Metrics Beyond Accuracy')
print('# ======================')
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Prediction Accuracy: 0.7074829931972789 # ====================== Confusion matrix Plot # ======================
# ====================== Confusion matrix # ====================== Confusion matrix: [[187 60] [ 26 21]] TP: 21 TN: 187 FP: 60 FN: 26 # ====================== Metrics Beyond Accuracy # ====================== Sensitivity is: 0.447 Specificity is: 0.757 Precision is: 0.259 Recall is: 0.447
print('# ======================')
print('Evaluation')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = dtree.predict_proba(X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(Y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
print('# ======================')
print('ROC Curve')
print('# ======================')
#define metrics
fpr, tpr, _ = metrics.roc_curve(Y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== Evaluation # ====================== ROC AUC score: 0.64 # ====================== ROC Curve # ======================
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
dtree, X_test, Y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f543ce10>]
After running the initial Decision Tree Model, model's accuracy was 71%.
Precision is 26% which means that the accuracy of predicting a True Negative is less than chance or a flip of unbiased coin. This means that our initial DT model is not a good at predicting true negatives and the fine tuned model needs to focus reducing false positives (reducing predicting stayers instead of leavers).
The Area Under the Curve (AUC) is 64%, which is an average marker for predicting True Positives (employees who stayed), however, the recall and the precision curve is not great at predicting True Negatives and reducing False Positives.
Recal-Precision Curve shows a very disappointing outcome at predicting recall.
# =============================================================================
# IMPORTANT FEATURES
# =============================================================================
featimp = pd.Series(dtree.feature_importances_, index=list(X)).sort_values(ascending=False)
print(featimp[:20])
#selecting the top 20 features
top_dtree_features = featimp[:20].sort_values()
############ CREATING PLOT ############
# Creating a plot to visualise the feature importance
sns.set()
fig, ax = plt.subplots(figsize=(12,12))
top_dtree_features.plot(kind='barh')
plt.title("Decision Tree Feature Importance", fontsize = 18)
plt.show()
MaritalStatus_Divorced 0.217353 MaritalStatus_Married 0.172947 MonthlyIncome 0.145399 JobSatisfaction 0.108628 Age 0.100918 EducationField_Medical 0.059883 DistanceFromHome 0.038354 JobRole_Research_Scientist 0.035125 JobRole_Healthcare_Representative 0.035106 BusinessTravel_Travel_Rarely 0.024331 TrainingTimesLastYear 0.022006 YearsSinceLastPromotion 0.020778 JobInvolvement 0.015559 OverTime 0.003612 WorkLifeBalance 0.000000 JobRole_Laboratory_Technician 0.000000 PercentSalaryHike 0.000000 JobRole_Sales_Executive 0.000000 EducationField_Human_Resources 0.000000 JobRole_Human_Resources 0.000000 dtype: float64
# # =============================================================================
# # FINDING BEST PARATMETERS
# # =============================================================================
# # Define the grid of hyperparameters 'params_dt'
# dt_params = {'max_depth': [4,5,6,7,8,9,10,15,20,25,30,35],
# 'min_samples_leaf': [0.04, 0.06, 0.08],
# 'max_features': ['auto', 'sqrt', 'log2'],
# #'max_features': [0.2, 0.4,0.6, 0.8],
# 'criterion':['entropy','gini']}
# classifier = tree.DecisionTreeClassifier()
# # Instantiate a 10-fold CV grid search object 'grid_dt'
# dt_grid = GridSearchCV(estimator = classifier,
# param_grid = dt_params,
# scoring = 'precision', # to minimise false positives
# cv = 10,
# n_jobs = -1)
# # Fit 'grid_dt' to the training data
# dt_grid.fit(X_train, Y_train)
# # Extract best model from 'grid_dt'
# dt_best_model = dt_grid.best_estimator_
# # Evaluate test set accuracy
# dt_test_acc = dt_best_model.score(X_test, Y_test)
# # Print test set accuracy
# print("Test set accuracy of Decision Tree best model: {:.3f}".format(dt_test_acc))
# ## Extract best CV parameters
# dt_best_parameters = dt_grid.best_params_
# print('Best CV parameters for Decision Tree', dt_best_parameters)
# # Extract best CV score
# dt_best_result = dt_grid.best_score_
# print('Best CV score for Decision Tree: {:.3f}'.format(dt_best_result))
The following features were identified as signicant from the initial Decision Tree
These features were used to be identify the best parameters for the Decision Tree.
To identify the best number of features, the variables that scored the lowest were dropped, resulting in the model being rerun and re-evaluated. The table below summarises the number of variables remaining and the accuracy and other metrics for each model. It was decided that 14 variables was best suited because both accuracy and precision (reducing the number of false positives) were the highest.
After running the 10-fold CV grid search, this is the parameters that would suit best for the Decision Tree. Of course, the code is rerun, it is possible that the modelt will suggest other parameters.
The Decision Tree was fine-tuned based on this screenshot.
# dropping insignificant variables
top_dtree_features_list = ['MaritalStatus_Divorced', # 0.217353
'MaritalStatus_Married', # 0.172947
'MonthlyIncome', # 0.145399
'JobSatisfaction', # 0.108628
'Age', # 0.100918
'EducationField_Medical', # 0.059883
'DistanceFromHome',# 0.038354
'JobRole_Research_Scientist', # 0.035125
'JobRole_Healthcare_Representative', # 0.035106
'BusinessTravel_Travel_Rarely', # 0.024331
'TrainingTimesLastYear', # 0.022006
'YearsSinceLastPromotion', # 0.020778
'JobInvolvement', # 0.015559
'OverTime', # 0.003612
# 'WorkLifeBalance', # 0.000000
# 'JobRole_Laboratory_Technician', # 0.000000
# 'PercentSalaryHike',# 0.000000
# 'JobRole_Sales_Executive',# 0.000000
# 'EducationField_Human_Resources',# 0.000000
# 'JobRole_Human_Resources'# 0.000000
]
print('Number of Variables: ', len(top_dtree_features_list))
# Dividing dataset into label and feature sets
X = X_scaled[top_dtree_features_list].copy()
y= final_data['Attrition'] # Labels
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
# splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)
# Implementing Oversampling to balance the dataset; SMOTE stands for Synthetic Minority Oversampling Technique
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train, Y_train = smote.fit_resample(X_train, Y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())
Number of Variables: 14 <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> (1470, 14) (1470,) Number of observations in each class before oversampling (training data): 0 986 1 190 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 986 1 986 Name: Attrition, dtype: int64
# =============================================================================
# FINE-TUNING DECISION TREE MODEL
# =============================================================================
# creating a fine-tuned model
dt_tuned = tree.DecisionTreeClassifier(random_state = seed,
criterion = 'entropy',
max_depth = 30,
max_features ='sqrt',
min_samples_leaf = 0.04)
# training the fine-tuned model
dt_tuned.fit(X_train, Y_train)
# predicting the value
dt_y_pred_tuned = dt_tuned.predict(X_test)
conf_mat = metrics.confusion_matrix(Y_test, dt_y_pred_tuned)
print('Prediction Accuracy of Decision Tree Best Model: {:.3f}'.format(metrics.accuracy_score(Y_test, dt_y_pred_tuned)))
print('# ======================')
print('Confusion matrix Plot')
print('# ======================')
conf_mat = metrics.confusion_matrix(Y_test, dt_y_pred_tuned)
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('# ======================')
print('Confusion matrix')
print('# ======================')
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
print('# ======================')
print('Metrics Beyond Accuracy')
print('# ======================')
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Prediction Accuracy of Decision Tree Best Model: 0.769 # ====================== Confusion matrix Plot # ======================
# ====================== Confusion matrix # ====================== Confusion matrix: [[204 43] [ 25 22]] TP: 22 TN: 204 FP: 43 FN: 25 # ====================== Metrics Beyond Accuracy # ====================== Sensitivity is: 0.468 Specificity is: 0.826 Precision is: 0.338 Recall is: 0.468
print('# ======================')
print('Evaluation')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = dt_tuned.predict_proba(X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(Y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
print('# ======================')
print('ROC Curve')
print('# ======================')
#define metrics
fpr, tpr, _ = metrics.roc_curve(Y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== Evaluation # ====================== ROC AUC score: 0.72 # ====================== ROC Curve # ======================
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
dt_tuned,X_test, Y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f5163e10>]
# =============================================================================
# IMPORTANT FEATURES
# =============================================================================
featimp = pd.Series(dt_tuned.feature_importances_, index=list(X)).sort_values(ascending=False)
print(featimp[:20])
#selecting the top 20 features
top_dtree_features = featimp[:20].sort_values()
############ CREATING PLOT ############
# Creating a plot to visualise the feature importance
sns.set()
fig, ax = plt.subplots(figsize=(12,12))
top_dtree_features.plot(kind='barh')
plt.title("Decision Tree Feature Importance", fontsize = 18)
plt.show()
MonthlyIncome 0.240514 JobSatisfaction 0.190978 MaritalStatus_Divorced 0.153344 EducationField_Medical 0.133980 MaritalStatus_Married 0.108915 BusinessTravel_Travel_Rarely 0.089632 DistanceFromHome 0.044210 TrainingTimesLastYear 0.026892 OverTime 0.010360 YearsSinceLastPromotion 0.001176 Age 0.000000 JobRole_Research_Scientist 0.000000 JobRole_Healthcare_Representative 0.000000 JobInvolvement 0.000000 dtype: float64
# =============================================================================
# DECISION TREE - CROSS VALIDATION
# =============================================================================
# training the model
dt_tuned_model = dt_tuned.fit(X_train, Y_train)
# cross validation using the recall score for classifier
dt_scores = cross_val_score(dt_tuned_model, X_train, Y_train,scoring="precision", cv = 5)
print("Mean Cross Validation score of Decion Tree: {:.3f}".format(np.mean(dt_scores)))
print("SD Cross Validation score of Decion Tree: {:.3f}".format(np.std(dt_scores)))
# original best model
print('Accuracy of the Fine-Tuned Decision Tree model (no CV) is: {:.3f}'.format(dt_tuned.score(X_test, Y_test)))
Mean Cross Validation score of Decion Tree: 0.728 SD Cross Validation score of Decion Tree: 0.029 Accuracy of the Fine-Tuned Decision Tree model (no CV) is: 0.769
Cross-Validation is a great to examine the evaluate the accuracy of a decision tree model. The cross validation was set to score precision because our aim is to reduce False Positives. The mean cross validation score of decision tree is 73% (SD Cross Validation ± 3%), while accuracy of the fine-tuned decision Ttee model without cross validation is 77%.
Although the accuracy for cross-validation was reduced, this means that cross validation provides a better estimate of accuracy because the performance is based on unseen data.
After dropping different variables, optimal number of varaibles was found to be 14 variables because of the high accuracy and, unfortunately, slight higher precision, which still worse than chance.
Significant Features
These 14 variables were seen as the best predictors:
Features that were significant were Marital Status_Divorced, age, whether the employee worked over time, studied medicine, distance from home amd the number of years since last promotion.
Accuracy
Even though several features were dropped and the best paramaters for DT were found, the accuracy of the fine-tuned decision tree model improved slight to 77%.
Precision
Although the precision increased, it remained below the chance of a flip of unbised coin (with 30%) which means that the fine-tuned decision tree is still fails to reduce False Positives (the employees who left but where predicted to stay).
AUC Curves
The AUC is 72%, which means that model is adequate at predicting True Positives (employees who stayed). Although the fine tuned DT precision-Recall curve is slightly better than the initial DT precision-Recall curve, it still below the diagonal (which is the 50% chance). The fine tuned DT failed to adequately predict True Negatives (truly left) and False Positves (predicted to have stayed but really left).
# =============================================================================
# Visualising the Fine-Tuned Decision Tree
# =============================================================================
# stating the features names
feature_names = X.columns
# stating the targets class
target_name = Y.unique().tolist()
# changing the two options into string so there is no erro
target_name[0] = 'stayed'
target_name[1] = 'left'
# plotting the tree
fig = plt.figure(figsize=(25,20))
plot_tree(dt_tuned,
feature_names = feature_names,
class_names = target_name,
filled = True, rounded = True)
[Text(0.7291666666666666, 0.9444444444444444, 'MaritalStatus_Divorced <= 0.5\nentropy = 1.0\nsamples = 1972\nvalue = [986, 986]\nclass = stayed'), Text(0.5416666666666666, 0.8333333333333334, 'JobSatisfaction <= 1.152\nentropy = 0.993\nsamples = 1664\nvalue = [752, 912]\nclass = left'), Text(0.375, 0.7222222222222222, 'MonthlyIncome <= -0.492\nentropy = 0.959\nsamples = 1316\nvalue = [502, 814]\nclass = left'), Text(0.16666666666666666, 0.6111111111111112, 'BusinessTravel_Travel_Rarely <= 0.5\nentropy = 0.821\nsamples = 633\nvalue = [162, 471]\nclass = left'), Text(0.08333333333333333, 0.5, 'OverTime <= 0.5\nentropy = 0.508\nsamples = 257\nvalue = [29, 228]\nclass = left'), Text(0.041666666666666664, 0.3888888888888889, 'entropy = 0.605\nsamples = 169\nvalue = [25, 144]\nclass = left'), Text(0.125, 0.3888888888888889, 'entropy = 0.267\nsamples = 88\nvalue = [4, 84]\nclass = left'), Text(0.25, 0.5, 'EducationField_Medical <= 0.5\nentropy = 0.937\nsamples = 376\nvalue = [133, 243]\nclass = left'), Text(0.20833333333333334, 0.3888888888888889, 'DistanceFromHome <= -0.635\nentropy = 0.843\nsamples = 295\nvalue = [80, 215]\nclass = left'), Text(0.16666666666666666, 0.2777777777777778, 'entropy = 0.948\nsamples = 79\nvalue = [29, 50]\nclass = left'), Text(0.25, 0.2777777777777778, 'DistanceFromHome <= -0.155\nentropy = 0.789\nsamples = 216\nvalue = [51, 165]\nclass = left'), Text(0.20833333333333334, 0.16666666666666666, 'entropy = 0.677\nsamples = 84\nvalue = [15, 69]\nclass = left'), Text(0.2916666666666667, 0.16666666666666666, 'entropy = 0.845\nsamples = 132\nvalue = [36, 96]\nclass = left'), Text(0.2916666666666667, 0.3888888888888889, 'entropy = 0.93\nsamples = 81\nvalue = [53, 28]\nclass = stayed'), Text(0.5833333333333334, 0.6111111111111112, 'EducationField_Medical <= 0.5\nentropy = 1.0\nsamples = 683\nvalue = [340, 343]\nclass = left'), Text(0.5416666666666666, 0.5, 'MonthlyIncome <= 1.349\nentropy = 0.986\nsamples = 547\nvalue = [236, 311]\nclass = left'), Text(0.5, 0.3888888888888889, 'MaritalStatus_Married <= 0.5\nentropy = 0.965\nsamples = 468\nvalue = [183, 285]\nclass = left'), Text(0.4166666666666667, 0.2777777777777778, 'TrainingTimesLastYear <= 0.15\nentropy = 0.778\nsamples = 274\nvalue = [63, 211]\nclass = left'), Text(0.375, 0.16666666666666666, 'YearsSinceLastPromotion <= -0.375\nentropy = 0.632\nsamples = 195\nvalue = [31, 164]\nclass = left'), Text(0.3333333333333333, 0.05555555555555555, 'entropy = 0.572\nsamples = 96\nvalue = [13, 83]\nclass = left'), Text(0.4166666666666667, 0.05555555555555555, 'entropy = 0.684\nsamples = 99\nvalue = [18, 81]\nclass = left'), Text(0.4583333333333333, 0.16666666666666666, 'entropy = 0.974\nsamples = 79\nvalue = [32, 47]\nclass = left'), Text(0.5833333333333334, 0.2777777777777778, 'DistanceFromHome <= -0.122\nentropy = 0.959\nsamples = 194\nvalue = [120, 74]\nclass = stayed'), Text(0.5416666666666666, 0.16666666666666666, 'entropy = 0.869\nsamples = 100\nvalue = [71, 29]\nclass = stayed'), Text(0.625, 0.16666666666666666, 'entropy = 0.999\nsamples = 94\nvalue = [49, 45]\nclass = stayed'), Text(0.5833333333333334, 0.3888888888888889, 'entropy = 0.914\nsamples = 79\nvalue = [53, 26]\nclass = stayed'), Text(0.625, 0.5, 'entropy = 0.787\nsamples = 136\nvalue = [104, 32]\nclass = stayed'), Text(0.7083333333333334, 0.7222222222222222, 'BusinessTravel_Travel_Rarely <= 0.5\nentropy = 0.858\nsamples = 348\nvalue = [250, 98]\nclass = stayed'), Text(0.6666666666666666, 0.6111111111111112, 'entropy = 0.962\nsamples = 122\nvalue = [75, 47]\nclass = stayed'), Text(0.75, 0.6111111111111112, 'DistanceFromHome <= -0.093\nentropy = 0.77\nsamples = 226\nvalue = [175, 51]\nclass = stayed'), Text(0.7083333333333334, 0.5, 'entropy = 0.659\nsamples = 135\nvalue = [112, 23]\nclass = stayed'), Text(0.7916666666666666, 0.5, 'entropy = 0.89\nsamples = 91\nvalue = [63, 28]\nclass = stayed'), Text(0.9166666666666666, 0.8333333333333334, 'DistanceFromHome <= 0.408\nentropy = 0.795\nsamples = 308\nvalue = [234, 74]\nclass = stayed'), Text(0.875, 0.7222222222222222, 'MonthlyIncome <= -0.577\nentropy = 0.701\nsamples = 211\nvalue = [171, 40]\nclass = stayed'), Text(0.8333333333333334, 0.6111111111111112, 'entropy = 0.989\nsamples = 82\nvalue = [46, 36]\nclass = stayed'), Text(0.9166666666666666, 0.6111111111111112, 'entropy = 0.199\nsamples = 129\nvalue = [125, 4]\nclass = stayed'), Text(0.9583333333333334, 0.7222222222222222, 'entropy = 0.935\nsamples = 97\nvalue = [63, 34]\nclass = stayed')]
# from sklearn.tree import export_graphviz
# # !pip install six
# from six import StringIO
# from IPython.display import Image
# import pydotplus
# dot_data = StringIO()
# export_graphviz(dt_tuned,
# feature_names = feature_names,
# class_names = target_name,
# out_file=dot_data,
# filled=True, rounded=True,
# special_characters=True)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# # Show graph
# Image(graph.create_png())
Accuracy
Initial Decision Tree Model accuracy was 71%. After dropping features and the grid search for the best parameters, it was improved to 77%.
Precision
In the initial model, the precision is 26% and it was improved to 37%, which is still very inaccurate at predicting True Negatives and reducing False Negatives.
Curve
The inittial Area Under the Curve (AUC) was 64%, and the fine tuned model was 72%, which is a very positive improvement! Unfortunately, Recall-Precision Curve failed to improve even when the model was fine tuned.
# =============================================================================
# Dataset with All variables
# =============================================================================
# Dividing dataset into label and feature sets
X = X_scaled[features].copy()
y= final_data['Attrition'] # Labels
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
# splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)
# Implementing Oversampling to balance the dataset; SMOTE stands for Synthetic Minority Oversampling Technique
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train, Y_train = smote.fit_resample(X_train, Y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> (1470, 23) (1470,) Number of observations in each class before oversampling (training data): 0 986 1 190 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 986 1 986 Name: Attrition, dtype: int64
# checking that all variables are present
print(X.head(1).transpose())
0 Age 0.446350 DistanceFromHome -1.010909 MonthlyIncome -0.108350 PercentSalaryHike -1.150554 TrainingTimesLastYear -2.171982 WorkLifeBalance -2.493820 YearsSinceLastPromotion -0.679146 JobInvolvement 0.379672 JobSatisfaction 1.153254 Gender 0.000000 OverTime 1.000000 BusinessTravel_Non-Travel 0.000000 BusinessTravel_Travel_Rarely 1.000000 Department_Human_Resources 0.000000 EducationField_Human_Resources 0.000000 EducationField_Medical 0.000000 JobRole_Healthcare_Representative 0.000000 JobRole_Human_Resources 0.000000 JobRole_Laboratory_Technician 0.000000 JobRole_Research_Scientist 0.000000 JobRole_Sales_Executive 1.000000 MaritalStatus_Married 0.000000 MaritalStatus_Divorced 0.000000
# =============================================================================
# Random Forest - First Model
# =============================================================================
rfc = RandomForestClassifier(n_estimators=300, criterion='entropy', max_features='auto')
rfc.fit(X_train,Y_train)
Y_pred = rfc.predict(X_test)
rf_accuracy = metrics.accuracy_score(Y_test, Y_pred)
print('Random Forest Prediction Accuracy: {:.3f}'.format(rf_accuracy))
print('# ======================')
print('Confusion matrix Plot')
print('# ======================')
conf_mat = metrics.confusion_matrix(Y_test, Y_pred)
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('# ======================')
print('Confusion matrix')
print('# ======================')
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
print('# ======================')
print('Metrics Beyond Accuracy')
print('# ======================')
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Random Forest Prediction Accuracy: 0.827 # ====================== Confusion matrix Plot # ======================
# ====================== Confusion matrix # ====================== Confusion matrix: [[233 14] [ 37 10]] TP: 10 TN: 233 FP: 14 FN: 37 # ====================== Metrics Beyond Accuracy # ====================== Sensitivity is: 0.213 Specificity is: 0.943 Precision is: 0.417 Recall is: 0.213
Accuracy of the model is 84%, however, precision is 50% This means that the model predicts the majority class (employees who have stayed) well, but it is appalling at predicting the minority class (those who have left).
Although ROC AUC Curve is 72%, the recall-precision curve is still below the diagonal line, meaning that the RF model is still failing to overestimates the number of people who stayed.
print('# ======================')
print('Evaluation')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = rfc.predict_proba(X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(Y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
print('# ======================')
print('ROC Curve')
print('# ======================')
#define metrics
fpr, tpr, _ = metrics.roc_curve(Y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== Evaluation # ====================== ROC AUC score: 0.74 # ====================== ROC Curve # ======================
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
rfc, X_test, Y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f4ec1e90>]
# =============================================================================
# IMPORTANT FEATURES
# =============================================================================
featimp = pd.Series(rfc.feature_importances_, index=list(X)).sort_values(ascending=False)
print(featimp)
# selecting the top 20 features
top_dtree_features = featimp.sort_values()
# Creating a plot to visualise the feature importance
sns.set()
fig, ax = plt.subplots(figsize=(12,12))
top_dtree_features.plot(kind='barh')
plt.title("Decision Tree Feature Importance", fontsize = 18)
plt.show()
MonthlyIncome 0.112625 Age 0.102482 JobSatisfaction 0.095748 DistanceFromHome 0.083189 PercentSalaryHike 0.069239 YearsSinceLastPromotion 0.063573 TrainingTimesLastYear 0.062618 MaritalStatus_Married 0.060272 WorkLifeBalance 0.057658 JobInvolvement 0.047584 MaritalStatus_Divorced 0.042701 BusinessTravel_Travel_Rarely 0.031287 EducationField_Medical 0.031150 JobRole_Research_Scientist 0.025516 BusinessTravel_Non-Travel 0.023273 OverTime 0.020917 Gender 0.020263 JobRole_Healthcare_Representative 0.016040 JobRole_Sales_Executive 0.013663 JobRole_Laboratory_Technician 0.012160 Department_Human_Resources 0.003110 JobRole_Human_Resources 0.003030 EducationField_Human_Resources 0.001901 dtype: float64
top_random_forest_features_list = [
'MonthlyIncome', # 0.110888
'Age' , # 0.102021
'JobSatisfaction', # 0.093081
'DistanceFromHome', # 0.084621
'PercentSalaryHike', # 0.070708
'TrainingTimesLastYear', # 0.064013
'YearsSinceLastPromotion', # 0.063710
'MaritalStatus_Married', #0.061482
'WorkLifeBalance', # 0.056613
'JobInvolvement', # 0.046655
'MaritalStatus_Divorced', # 0.044617
'EducationField_Medical', # 0.032227
'BusinessTravel_Travel_Rarely', # 0.030769
'JobRole_Research_Scientist', # 0.025797
'BusinessTravel_Non-Travel', # 0.022402
'OverTime', # 0.020679
'Gender', # 0.018865
'JobRole_Healthcare_Representative' , # 0.015195
'JobRole_Sales_Executive', # 0.014626
'JobRole_Laboratory_Technician', # 0.012497
# 'Department_Human_Resources', # 0.003529
# 'JobRole_Human_Resources', # 0.003151
# 'EducationField_Human_Resources', # 0.001854
]
print('Number of Variable selected: ', len(top_random_forest_features_list))
Number of Variable selected: 20
The following features were identified as signicant from the Random Forest
These features were used to be identify the best parameters for the Random Forest.
To identify the best number of features, the variables that scored the lowest were dropped, resulting in the model being rerun and re-evaluated. The table below summarises the number of variables remaining and the accuracy and other metrics for each model. It was decided that 20 variables was best suited because both accuracy and precision (reducing the number of false positives) were the highest.
# # =============================================================================
# # FINDING BEST PARATMETERS
# # =============================================================================
# # Tuning the random forest parameter 'n_estimators' and implementing cross-validation using Grid Search
# rf_param = {'n_estimators': [200, 250, 300, 350, 400, 450],
# 'max_features': ['auto', 'sqrt', 'log2'],
# #'max_features': [0.2, 0.4,0.6, 0.8]
# 'max_depth' : [4,5,6,7,8],
# 'criterion' : ['gini', 'entropy']}
# rf = RandomForestClassifier()
# rf_grid = GridSearchCV(estimator = rf, param_grid = rf_param, cv=5)
# rf_grid.fit(rf_X_train, rf_y_train)
# # Extract best model from rf_grid
# rf_best_model = rf_grid.best_estimator_
# # Evaluate test set accuracy from rf_grid
# rf_test_acc = rf_grid.score(rf_X_test, rf_y_test)
# # Print test set accuracy from rf_grid
# print("Test set accuracy of Random Forest best model: {:.3f}".format(rf_test_acc))
# ## Extract best CV parameters from rf_grid
# rf_best_parameters = rf_grid.best_params_
# print('Best CV parf_gridrameters for Random Forest', rf_best_parameters)
# # Extract best CV score from rf_grid
# rf_best_result = rf_grid.best_score_
# print('Best CV score for Random Forest: {:.3f}'.format(rf_best_result))
After running the 10-fold CV grid search, this is the parameters that would suit best for the Random Forest. Of course, the code is rerun, it is possible that the modelt will suggest other parameters.
The Random Forest was fine-tuned based on this screenshot.
# =============================================================================
# SELECTING BEST FEATURES FOR BEST RANDOM FOREST
# =============================================================================
# Dividing dataset into label and feature sets
rf_X = final_data[top_random_forest_features_list]
rf_y= final_data['Attrition'] # Labels
# splitting data
rf_X_train, rf_X_test, rf_y_train, rf_y_test = train_test_split(rf_X, rf_y, stratify = rf_y, test_size = 0.2, random_state = 42)
# Implementing Oversampling to balance the dataset; SMOTE stands for Synthetic Minority Oversampling Technique
print("Number of observations in each class before oversampling (training data): \n", pd.Series(rf_y_train).value_counts())
smote = SMOTE(random_state = 101)
rf_X_train, rf_y_train = smote.fit_resample(rf_X_train, rf_y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(rf_y_train).value_counts())
Number of observations in each class before oversampling (training data): 0 986 1 190 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 986 1 986 Name: Attrition, dtype: int64
# =============================================================================
# FINE-TUNING DECISION TREE MODEL
# =============================================================================
# rf_y_pred_tuned = rf_best_model.predict(X_test)
rfc_tuned = RandomForestClassifier(criterion = 'gini',
max_depth = 8,
max_features ='sqrt',
n_estimators = 400,
random_state = seed)
rfc_tuned.fit(rf_X_train, rf_y_train)
rf_y_pred_tuned = rfc_tuned.predict(rf_X_test)
conf_mat = metrics.confusion_matrix(rf_y_test,rf_y_pred_tuned)
print('Prediction Accuracy of Random Forest Best Model: {:.3f}'.format(metrics.accuracy_score(rf_y_test,rf_y_pred_tuned)))
print('# ======================')
print('Confusion matrix')
print('# ======================')
conf_mat = metrics.confusion_matrix(rf_y_test, rf_y_pred_tuned)
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('# ======================')
print('Confusion matrix')
print('# ======================')
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
print('# ======================')
print('Metrics Beyond Accuracy')
print('# ======================')
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Prediction Accuracy of Random Forest Best Model: 0.827 # ====================== Confusion matrix # ======================
# ====================== Confusion matrix # ====================== Confusion matrix: [[229 18] [ 33 14]] TP: 14 TN: 229 FP: 18 FN: 33 # ====================== Metrics Beyond Accuracy # ====================== Sensitivity is: 0.298 Specificity is: 0.927 Precision is: 0.438 Recall is: 0.298
# =============================================================================
# Random Forest - IMPORTANT FEATURES
# =============================================================================
rf_tuned_feature_importance = pd.Series(rfc_tuned.feature_importances_, index=list(rf_X)).sort_values(ascending=False)
print(rf_tuned_feature_importance)
#selecting the top features
top_rf_features = rf_tuned_feature_importance.sort_values()
############ CREATING PLOT ############
# Creating a plot to visualise the feature importance
sns.set()
fig, ax = plt.subplots(figsize=(12,12))
top_rf_features.plot(kind='barh')
plt.title("Random Forest Feature Importance", fontsize = 18)
plt.show()
MaritalStatus_Married 0.136967 MonthlyIncome 0.122650 JobSatisfaction 0.098977 MaritalStatus_Divorced 0.078506 JobInvolvement 0.069839 Age 0.065884 WorkLifeBalance 0.055788 DistanceFromHome 0.049172 EducationField_Medical 0.046106 BusinessTravel_Travel_Rarely 0.038464 TrainingTimesLastYear 0.034612 BusinessTravel_Non-Travel 0.034050 JobRole_Research_Scientist 0.033871 PercentSalaryHike 0.031364 YearsSinceLastPromotion 0.028341 JobRole_Healthcare_Representative 0.024879 OverTime 0.015561 Gender 0.014025 JobRole_Laboratory_Technician 0.011371 JobRole_Sales_Executive 0.009575 dtype: float64
print('# ======================')
print('Evaluation')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = rfc_tuned.predict_proba(rf_X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(rf_y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
#define metrics
fpr, tpr, _ = metrics.roc_curve(rf_y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== Evaluation # ====================== ROC AUC score: 0.72
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
rfc_tuned, rf_X_test, rf_y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f4b15190>]
After dropping different variables, optimal number of varaibles was found to be 20 variables because of the high accuracy and, unfortunately, precision is still very worse than chance and, once again, we failed to meet our target of 70%.
Significant Features
These 20 variables were seen as the best predictors:
Some similar features were significant in RF as DT: Marital Status_Divorced, age, whether the employee worked over time, age, studied medicine, distance from home amd the number of years since last promotion.
Accuracy
Even though several features were dropped and the best paramaters for DT were found, the accuracy of the fine-tuned decision tree model improved slight to 833%.
Precision
Although the precision increased, it remained below the chance of a flip of unbised coin (with 47%) which means that the fine-tuned decision tree is still fails to reduce False Positives (the employees who left but where predicted to stay).
AUC Curves
The AUC is 72%, which means that model is adequate at predicting True Positives (employees who stayed). Although the fine tuned DT precision-Recall curve is slightly better than the initial DT precision-Recall curve, it still below the diagonal (which is the 50% chance). The fine tuned DT failed to adequately predict True Negatives (truly left) and False Positves (predicted to have stayed but really left).
# =============================================================================
# Random Forest - CROSS VALIDATION
# =============================================================================
# training the model
rfc_tuned_model = rfc_tuned.fit(rf_X_train, rf_y_train)
# cross validation using the recall score for classifier
rfc_scores = cross_val_score(rfc_tuned_model,rf_X_train, rf_y_train,scoring="precision", cv = 5)
print("Mean Cross Validation score of Random Forest: {:.3f}".format(np.mean(rfc_scores)))
print("SD Cross Validation score of Random Forest: {:.3f}".format(np.std(rfc_scores)))
print("Random Forest Score Without Cross Validation: {:.3f}".format(rfc_tuned.score(rf_X_train, rf_y_train)))
rf_y_pred_tuned = rfc_tuned_model.predict(rf_X_test)
# original best model
print('Accuracy of the Fine-Tuned Random Forest model (no CV) is: {:.3f}'.format(rfc_tuned.score(rf_X_train, rf_y_train)))
Mean Cross Validation score of Random Forest: 0.919 SD Cross Validation score of Random Forest: 0.015 Random Forest Score Without Cross Validation: 0.948 Accuracy of the Fine-Tuned Random Forest model (no CV) is: 0.948
Cross-Validation is a great to examine the evaluate the accuracy of a random forest model. The cross validation was set to score precision because our aim is to reduce False Positives. The mean cross validation score of random forest is 92% (SD Cross Validation ± 2.5%), while accuracy of the fine-tuned decision Ttee model without cross validation is 95%.
Although the accuracy for cross-validation was reduced, this means that cross validation provides a better estimate of accuracy because the performance is based on unseen data.
# Extract single tree
estimator = rfc_tuned_model.estimators_[5]
# stating the features names
feature_names = rf_X_train.columns
# stating the targets class
target_name = rf_y_train.unique().tolist()
# changing the two options into string so there is no erro
target_name[0] = 'stayed'
target_name[1] = 'left'
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot',
feature_names = feature_names,
class_names = target_name,
rounded = True, proportion = False,
precision = 2, filled = True)
# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
Accuracy
Initial Random Forest Model accuracy was 84%. After dropping features and the grid search for the best parameters, it did not improve, rremained at 83%.
In the initial model, the precision is 50% and dropped to 44%, highlighting its inaccuracy at predicting True Negatives and reducing False Negatives.
Both initial Area Under the Curve (AUC) and the fine tuned model was 72%! Unfortunately, Recall-Precision Curve failed to improve even when the model was fine tuned.
Random Forest was better at predicting True Positives than Decision Tree. This is very common because Random Forest were builds several DT and votes on the most accurate DT.
# =============================================================================
# Dataset with All variables
# =============================================================================
# Dividing dataset into label and feature sets
X = X_scaled[features].copy()
y= final_data['Attrition'] # Labels
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
# splitting data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)
# Implementing Oversampling to balance the dataset; SMOTE stands for Synthetic Minority Oversampling Technique
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train, Y_train = smote.fit_resample(X_train, Y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> (1470, 23) (1470,) Number of observations in each class before oversampling (training data): 0 986 1 190 Name: Attrition, dtype: int64 Number of observations in each class after oversampling (training data): 0 986 1 986 Name: Attrition, dtype: int64
X_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1972 entries, 0 to 1971 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1972 non-null float64 1 DistanceFromHome 1972 non-null float64 2 MonthlyIncome 1972 non-null float64 3 PercentSalaryHike 1972 non-null float64 4 TrainingTimesLastYear 1972 non-null float64 5 WorkLifeBalance 1972 non-null float64 6 YearsSinceLastPromotion 1972 non-null float64 7 JobInvolvement 1972 non-null float64 8 JobSatisfaction 1972 non-null float64 9 Gender 1972 non-null int64 10 OverTime 1972 non-null int64 11 BusinessTravel_Non-Travel 1972 non-null int64 12 BusinessTravel_Travel_Rarely 1972 non-null int64 13 Department_Human_Resources 1972 non-null int64 14 EducationField_Human_Resources 1972 non-null int64 15 EducationField_Medical 1972 non-null int64 16 JobRole_Healthcare_Representative 1972 non-null int64 17 JobRole_Human_Resources 1972 non-null int64 18 JobRole_Laboratory_Technician 1972 non-null int64 19 JobRole_Research_Scientist 1972 non-null int64 20 JobRole_Sales_Executive 1972 non-null int64 21 MaritalStatus_Married 1972 non-null int64 22 MaritalStatus_Divorced 1972 non-null int64 dtypes: float64(9), int64(14) memory usage: 354.5 KB
print(X.head(1).transpose())
0 Age 0.446350 DistanceFromHome -1.010909 MonthlyIncome -0.108350 PercentSalaryHike -1.150554 TrainingTimesLastYear -2.171982 WorkLifeBalance -2.493820 YearsSinceLastPromotion -0.679146 JobInvolvement 0.379672 JobSatisfaction 1.153254 Gender 0.000000 OverTime 1.000000 BusinessTravel_Non-Travel 0.000000 BusinessTravel_Travel_Rarely 1.000000 Department_Human_Resources 0.000000 EducationField_Human_Resources 0.000000 EducationField_Medical 0.000000 JobRole_Healthcare_Representative 0.000000 JobRole_Human_Resources 0.000000 JobRole_Laboratory_Technician 0.000000 JobRole_Research_Scientist 0.000000 JobRole_Sales_Executive 1.000000 MaritalStatus_Married 0.000000 MaritalStatus_Divorced 0.000000
# Logistic regression model
logm1 = sm.GLM(Y_train,(sm.add_constant(X_train)))
logm1.fit().summary()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
| Dep. Variable: | Attrition | No. Observations: | 1972 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 1948 |
| Model Family: | Gaussian | Df Model: | 23 |
| Link Function: | identity | Scale: | 0.13679 |
| Method: | IRLS | Log-Likelihood: | -824.65 |
| Date: | Fri, 29 Apr 2022 | Deviance: | 266.48 |
| Time: | 14:07:29 | Pearson chi2: | 266. |
| No. Iterations: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.8594 | 0.021 | 41.521 | 0.000 | 0.819 | 0.900 |
| Age | -0.0397 | 0.010 | -3.840 | 0.000 | -0.060 | -0.019 |
| DistanceFromHome | 0.0359 | 0.009 | 4.220 | 0.000 | 0.019 | 0.053 |
| MonthlyIncome | -0.1147 | 0.013 | -8.822 | 0.000 | -0.140 | -0.089 |
| PercentSalaryHike | -0.0122 | 0.009 | -1.413 | 0.158 | -0.029 | 0.005 |
| TrainingTimesLastYear | -0.0337 | 0.009 | -3.673 | 0.000 | -0.052 | -0.016 |
| WorkLifeBalance | -0.0469 | 0.008 | -5.858 | 0.000 | -0.063 | -0.031 |
| YearsSinceLastPromotion | 0.0182 | 0.010 | 1.850 | 0.064 | -0.001 | 0.038 |
| JobInvolvement | -0.0494 | 0.008 | -5.921 | 0.000 | -0.066 | -0.033 |
| JobSatisfaction | -0.0538 | 0.009 | -6.194 | 0.000 | -0.071 | -0.037 |
| Gender | -0.0173 | 0.017 | -1.006 | 0.314 | -0.051 | 0.016 |
| OverTime | 0.1207 | 0.018 | 6.560 | 0.000 | 0.085 | 0.157 |
| BusinessTravel_Non-Travel | -0.3710 | 0.036 | -10.169 | 0.000 | -0.443 | -0.300 |
| BusinessTravel_Travel_Rarely | -0.1831 | 0.019 | -9.679 | 0.000 | -0.220 | -0.146 |
| Department_Human_Resources | -0.2299 | 0.154 | -1.489 | 0.137 | -0.532 | 0.073 |
| EducationField_Human_Resources | 0.2099 | 0.102 | 2.068 | 0.039 | 0.011 | 0.409 |
| EducationField_Medical | -0.1343 | 0.021 | -6.452 | 0.000 | -0.175 | -0.093 |
| JobRole_Healthcare_Representative | -0.3288 | 0.038 | -8.597 | 0.000 | -0.404 | -0.254 |
| JobRole_Human_Resources | -0.0849 | 0.154 | -0.550 | 0.582 | -0.387 | 0.218 |
| JobRole_Laboratory_Technician | -0.1285 | 0.026 | -4.971 | 0.000 | -0.179 | -0.078 |
| JobRole_Research_Scientist | -0.3504 | 0.029 | -12.227 | 0.000 | -0.407 | -0.294 |
| JobRole_Sales_Executive | -0.1101 | 0.023 | -4.765 | 0.000 | -0.155 | -0.065 |
| MaritalStatus_Married | -0.2878 | 0.019 | -14.760 | 0.000 | -0.326 | -0.250 |
| MaritalStatus_Divorced | -0.3322 | 0.025 | -13.135 | 0.000 | -0.382 | -0.283 |
# =============================================================================
# LOGISTIC REGRESSION
# =============================================================================
#
logreg = LogisticRegression(random_state=42)
# Fit logreg to the training set
logreg.fit(X_train, Y_train)
y_pred = logreg.predict(X_test)
conf_mat = metrics.confusion_matrix(Y_test, y_pred)
print("Prediction Accuracy is: {:.3f}".format(metrics.accuracy_score(Y_test, y_pred)))
print('# ======================')
print('Confusion matrix Plot')
print('# ======================')
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Prediction Accuracy is: 0.779 # ====================== Confusion matrix Plot # ======================
Confusion matrix: [[203 44] [ 21 26]] TP: 26 TN: 203 FP: 44 FN: 21 Sensitivity is: 0.553 Specificity is: 0.822 Precision is: 0.371 Recall is: 0.553
print('# ======================')
print('AUC Curves')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = logreg.predict_proba(X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(Y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
#define metrics
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(Y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== AUC Curves # ====================== ROC AUC score: 0.72
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
logreg,X_test, Y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f5a88690>]
After running the initial Logistic Regression Model, model's accuracy was 78%.
Precision is 37% which means that the accuracy of predicting a True Negative is less than chance or a flip of unbiased coin. This means that our initial DT model is not a good at predicting true negatives and the fine tuned model needs to focus reducing false positives (reducing predicting stayers instead of leavers).
The Area Under the Curve (AUC) is 72%, which is an average marker for predicting True Positives (employees who stayed), however, the recall and the precision curve is not great at predicting True Negatives and reducing False Positives.
Recal-Precision Curve shows a very disappointing outcome at predicting recall.
# =============================================================================
# Logistic Regression - FEATURE SELECTION
# =============================================================================
# Recursive feature elimination (RFE) method is a feature selection technique
# which removes the attributes recursively and builds the model with remaining attributes.
rfe = RFE(estimator=logreg, n_features_to_select=15)
# Using RFE for feature selection
rfe.fit(X_train,Y_train)
#Summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
[False False True False False False False True False False True True True True True True True True True True True True True] [4 5 1 9 6 3 7 1 2 8 1 1 1 1 1 1 1 1 1 1 1 1 1]
# Using RFE for feature selection
logreg = LogisticRegression()
rfe = RFE(logreg)
rfe.fit(X_train,Y_train)
RFE(estimator=LogisticRegression())
rfe.support_
array([False, False, True, False, False, False, False, False, False,
False, True, True, True, True, True, True, True, False,
False, True, False, True, True])
# creating list out of significant features from LGM
rfe_col = list(X_train.columns[rfe.support_])
for i in rfe_col:
print(i)
MonthlyIncome OverTime BusinessTravel_Non-Travel BusinessTravel_Travel_Rarely Department_Human_Resources EducationField_Human_Resources EducationField_Medical JobRole_Healthcare_Representative JobRole_Research_Scientist MaritalStatus_Married MaritalStatus_Divorced
# selecting all significant features from LGM
rfe_X_train = X_train[rfe_col]
rfe_X_test = X_test[rfe_col]
rfe_X_test.head()
| MonthlyIncome | OverTime | BusinessTravel_Non-Travel | BusinessTravel_Travel_Rarely | Department_Human_Resources | EducationField_Human_Resources | EducationField_Medical | JobRole_Healthcare_Representative | JobRole_Research_Scientist | MaritalStatus_Married | MaritalStatus_Divorced | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1061 | -0.949765 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 891 | -0.954440 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 456 | 1.073882 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 922 | 2.695731 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 69 | -0.661856 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
print('Number of Variables: ', len(rfe_col))
Number of Variables: 11
# =============================================================================
# LOGISTIC REGRESSION
# =============================================================================
logreg = LogisticRegression(random_state=42)
# Fit logreg to the training set
logreg.fit(rfe_X_train, Y_train)
y_pred = logreg.predict(rfe_X_test)
conf_mat = metrics.confusion_matrix(Y_test, y_pred)
print("Prediction Accuracy is: {:.3f}".format(metrics.accuracy_score(Y_test, y_pred)))
print('# ======================')
print('Confusion matrix Plot')
print('# ======================')
plt.figure(figsize=(8,6))
ax = sns.heatmap(conf_mat,annot = True, fmt = ".1f")
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_xticklabels(['Stayed', 'Left'])
ax.set_yticklabels(['Stayed', 'Left'])
ax.set_title("Confusion Matrix")
plt.show()
print('Confusion matrix: \n', conf_mat)
TP = conf_mat[1,1] # true positive
TN = conf_mat[0,0] # true negatives
FP = conf_mat[0,1] # false positives
FN = conf_mat[1,0] # false negatives
print('TP: ', TP)
print('TN: ', TN)
print('FP: ', FP)
print('FN: ', FN)
# Metrics beyond accuracy
#calculate sensitiviy
sensitivity = TP / float(TP + FN)
#calculate specificity
specificity = TN / float(TN + FP)
#calculate precision
precision = TP / float(TP + FP)
#calculate recall
recall = TP / float(TP + FN)
print('Sensitivity is: {:.3f}'.format(sensitivity))
print('Specificity is: {:.3f}'.format(specificity))
print('Precision is: {:.3f}'.format(precision))
print('Recall is: {:.3f}'.format(recall))
Prediction Accuracy is: 0.748 # ====================== Confusion matrix Plot # ======================
Confusion matrix: [[199 48] [ 26 21]] TP: 21 TN: 199 FP: 48 FN: 26 Sensitivity is: 0.447 Specificity is: 0.806 Precision is: 0.304 Recall is: 0.447
print('# ======================')
print('AUC Curves')
print('# ======================')
#use model to predict probability that given y value is 1
y_pred_proba = logreg.predict_proba(rfe_X_test)[::,1]
#calculate AUC of model
roc_auc_score = metrics.roc_auc_score(Y_test, y_pred_proba)
print('ROC AUC score: {:.2f}'.format(roc_auc_score))
#define metrics
y_pred_proba = logreg.predict_proba(rfe_X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(Y_test, y_pred_proba)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(roc_auc_score))
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('2-class ROC Curve')
plt.legend(loc=4)
plt.show()
# ====================== AUC Curves # ====================== ROC AUC score: 0.67
### Precision Recall Plot
display = PrecisionRecallDisplay.from_estimator(
logreg,rfe_X_test, Y_test)
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.plot([0, 1],[1,0], color="navy", linestyle="--")
[<matplotlib.lines.Line2D at 0x7f46f4d914d0>]
We failed to reach some of our objectives. All three fine-tuned models had high accuracy of over 75% with RF having the highest accuracy with 83% and was build with 20 variables. However, all models failed to reduce False Positives and overfit the majority class (stayers). This means that the dataset overlearned the majority class in the training set.
There is some relationship between job roles, salary, and working overtime increase employee turnover. It is very difficult to make a very certain clear line because of the models overlearning the majority class.
There are several limitations to this study that could explain our accuracy and high number of False Positive.
The best option is to collect more data. It is highly likely that since the data was collected, there have been more people who have opted to leave the company and more people have been hired. This would lead us to having a bigger dataset, which would result in better modelling, due to the small dataset with only 1,233 observations and 217 observations in minority class (employees who left).
Due to the small dataset, the training and test sets were even smaller even though upsampling using SMOTE technique was used to create more of minority class in training set artificially. This led to our data overfitting and predicting the majority class exceptionally well, however, the aim of our research is to predict the minority class well. Only 16% of the workforce left, meaning that 84% stayed, which means that if the employer does not nothing to improve workplace environment, increase salary or help employees with their work-life balance, only 16% of employees will leave and our model will continue to predict majority class well.
There were several limitations to our models:
Other techniques could have been conducted to fine tune the models in Decision Tree and Random Forest such as AdaBoost Classification, Gradient Boosting, as well as Out-Of-Bag and Boostrapping.
It is highly recommended to address these needs and re-evaluating the models before deployment. Many limitations above are common practise for Data Analysts (Xu, Zhang and Li, 2011) and would make our models more accurate at predicting employees who leave. If these shortcomings are not addressed, then it is possible that an inaccurate model will be deployed, giving incorrect metrics that could be deployed in attempts to reduce high turnover rate, potentially damaging the reputation of the company.
Due to our results, we would highly recommend to NOT deploy these models into the working environment. The next stage should be to revise and re-evaluate HR strategies such as collect more data and examine the some issues with the dataset.
It is important to remember that employees motives and issues are not captured during this dataset. Even though our data visualisation analysis showed that early half of laboratory technicians left the worklplace, it did not highlight the reasons or the employee’s motives. Over half of all laboratory technicians work overtime and only 26% have left the workplace.
Our analysis can only lead us so far because our data might not capture all issues in workplace. Perhaps, their manager was highly controlling and has created a toxic working environment that leads to employees working overtime. This might be true or false, however, this is not captured in the data and, therefore, a clear-cut reasons why laboratory technicians leave. Interviews and focus groups will have to be conducted to examine the reasons why employees resign.
Our analysis highlighted a pattern and found a relationship. Unlike other issues, this involves humans with emotions and their personal goals and motives that will have to examined to understand the real reasons why they leave. Some models were built using the employees’ marital status, finding them significant. During the initial stages when the algorithms were being built, when the marital status columns were removed, the algorithms performed worse. This could suggest that there is some sort of relationship. Perhaps, divorced employees wish to relocate to move home or having mental health issues due to the stress of the divorce, or married people wish to spend more time with their children. It is unclear what is the real reason that marital status has this an effect on the algorithms. It would be advised that the HR examines their flexibility around ‘Working From Home’.
Conducting exit interviews for employees before their leave could be a great method to highlight the reasons why the individual has decided to resign from their posts. The answers will have to analysed using thematic analysis, commonly used in psychology, and is viewed as a scientific method for analysing interviews. It would be highly recommended to use external HR manager to conduct the interview so that the leaving employee are more willing to be honest about the reasons.
Meanwhile, surveys are quantitatively conducted, meaning that the HR managers can conduct statistical analysis in Python or Tableau to represent the data and can highlight common themes in the employees who leave.
Bi-annually HR managers should conduct focus groups to examine the common themes that occur in employee conversations. Once again, this should be conducted with an external HR manager, so that the employees are more likely to be honest. Prior to COVID-19, office perks such as free fruit and nap pods were seen as caring gestures of the FAANG (Facebook, Amazon, Apple, Netflix and Google) (Cassidy, 2017; Shine Workplace Wellbeing, 2019).
Byrne, B. M. (2010) Multivariate Applications Series: Structural Equation Modelling with AMOS: Basic Concepts, Applications, and Programming. 2nd edn. United States of America: Taylor and Francis Group, LLC.
George, D. and Mallery, P. (2010) SPSS for Windows Step by Step: A Simple Guide and Reference. 10th edn. Pearson, Boston.
Hair, J. F., Black, W. C., Babin, B. J., and Anderson, R. E.(2010) Multivariate Data Analysis: Overview of Multivariate Methods. 7th edn. New Jersey: Pearson Education International.
Brownlee, J. (2019) ‘How to Calculate Precision, Recall, F1, and More for Deep Learning Models’, Machine Learning Mastery. Available at: https://machinelearningmastery.com/how-to-calculate-precision-recall-f1-and-more-for-deep-learning-models/ (Accessed: 13 April 2022).
Cassidy, A. (2017) ‘Clocking off: the companies introducing nap time to the workplace’, The Guardian. Available at: https://www.theguardian.com/business-to-business/2017/dec/04/clocking-off-the-companies-introducing-nap-time-to-the-workplace (Accessed: 9 April 2022).
Kumara, V. (2020) A Guide To Understanding AdaBoost, Paperspace Blog. Available at: https://blog.paperspace.com/adaboost-optimizer/.
Kunchhal, R. (2020) Out of Bag Score | OOB Score Random Forest Machine Learning. Available at: https://www.analyticsvidhya.com/blog/2020/12/out-of-bag-oob-score-in-the-random-forest-algorithm/ (Accessed: 29 April 2022).
Miel, R. (2021) ‘Welcome to “The Great Resignation”’, Plastics News, 11 October. Available at: https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,shib&db=edsbig&AN=edsbig.A678946036&site=eds-live&scope=site&custid=s4214462 (Accessed: 22 February 2022).
Precision-Recall (no date) scikit-learn. Available at: https://scikit-learn/stable/auto_examples/model_selection/plot_precision_recall.html (Accessed: 13 April 2022).
Shine Workplace Wellbeing (2019) ‘Free fruit at work - advantages of offering healthy snacks to staff’, Shine Workplace Wellbeing. Available at: https://www.shineworkplacewellbeing.com/free-fruit-at-work/ (Accessed: 9 April 2022).
sklearn.metrics.f1_score (no date) scikit-learn. Available at: https://scikit-learn/stable/modules/generated/sklearn.metrics.f1_score.html (Accessed: 13 April 2022).
Vorhies, W (2016) CRISP DM Model [Photograph]. Available at: https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome (Accessed: 30 January 2022)
Walsh, D. (2021) ‘The Great Resignation’ hits the executive ranks, too. Available at: https://eds.s.ebscohost.com/eds/detail/detail?vid=5&sid=7331b78a-4e25-45f2-b256-ea9615b44488%40redis&bdata=JkF1dGhUeXBlPWlwLHNoaWImc2l0ZT1lZHMtbGl2ZSZzY29wZT1zaXRl#AN=edsbig.A682030477&db=edsbig (Accessed: 22 February 2022).
Xu, G., Zhang, Y. and Li, L. (2011) Web Mining And Social Networking. 1st ed. New York: NY Springer.
24/01/2022 - 30/01/2022
Chose my dataset, examined business interests in conducting research. looked as data structure, examined variables, did quick visualisations of data with pandas.
31/01/2022 - 06/02/2022
Decided to use CRISP-DM as a frame work to analyse the dataset. Read up on the CRISP-DM, write a bit of introduction.
07/02/2022 - 13/02/2022
Conducted Business Understanding and CRISP-DM write up for the first part of analysis.
14/02/2022 - 20/02/2022 Read up on Data Preparation for my dataset and thought about data exploration techniquest that I should use.
Conducted Data Preparation such as converting values to categorical variables or numeric.
Created plotly graphs to examine the distribution of numeric values.
21/02/2022 - 27/02/2022
Watched tutorials about Tableau and read up on best practises for data storytelling.
Created Tableau Dashboard to visualise the data differently.
28/02/2022 -13/03/2022
Read up on feature selection using VIF and correlations. Wrote code for VIF and correlations, examined results and concluded accordingly. Wrote up the resutls.
14/03/2022 - 20/03/2022
Started exploring final data preparation such as dummy variables, SMOTE-ing and splitting the dataset into training and test sets.
21/03/2022 - 27/03/2022
Read up about Decision Trees - best practises and pitfals.
Created Decision Tree, evaluated performance and fine-tuned DT.
28/03/2022 - 03/04/2022
Read up about Random Forest - best practises and pitfals.
Created Random Forest, evaluated performance and fine-tuned RF.
04/04/2022 - 10/04/2022
Read up about Logistic Regression - best practises and pitfals.
Created Logistic Regression, evaluated performance and fine-tuned LG.
11/04/2022 - 29/04/2022 Write up of each algorithm, fixing Business Understanding. Thinking about limitations and future research. Final write up and final fine tuning of models.